Chapter 2. Secondary Sort: A Detailed Example
The MapReduce framework sorts input to reducers by key, but values of reducers are arbitrarily ordered. This means that if all mappers generate the following key-value pairs for key
= K
:
(K, V1), (K, V2), ..., (K, Vn)
then all these values {V1
, V2
, ..., Vn
} will be processed by a single reducer (for key
= K
), but there will be no order (ascending or descending) between instances of Vi
. As you learned in Chapter 1, Secondary Sort is a design pattern we can use to apply an order (such as “ascending sort” or “descending sort”) to the values. How do we accomplish this? Say we want to apply some order to the reducer values:
S1 ≤ S2 ≤ ... ≤ Sn
or:
S1 ≥ S2 ≥ ... ≥ Sn
where Si
∈ {V1, V2, ..., Vn}
for i = {1, 2, ..., n}
. Note that each Vi
might be a simple data type, such as String
or Integer
, or a tuple (more than a single value—that is, a composite object).
There are two ways to sort reducer values:
- Solution #1
Buffer reducer values in memory, then sort. If the number of reducer values is small enough to fit in memory (per reducer), then this solution will work. But if the number of reducer values is high, then they might not fit in memory (not a preferable solution). Implementation of this solution is simple; it is presented in Chapter 1 and will not be discussed in this chapter.
- Solution #2
Use the Secondary Sort design pattern of the MapReduce framework, and reducer values will arrive sorted (i.e., there’s no need to sort values ...
Get Data Algorithms now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.