Chapter 13. Use Cases and Programming Examples
In this chapter we will take a look at several comprehensive Pig examples and real-world Pig use cases.
Sparse Tuples
In “Schema Tuple Optimization” we introduced a more compact tuple implementation called the schema tuple. However, if your input data is sparse, a schema tuple is not the most efficient way to represent your data. You only need to store the position and value of nonempty fields of the tuple—which you can do with a sparse tuple. Since the vast majority of fields in the tuple will be empty, you can save a lot of space with this data structure. Sparse tuples are not natively supported by Pig. However, Pig allows users to define custom tuple implementations, so you can implement them by yourself. In this section, we will show you how to implement the sparse tuple and use it in Pig.
First, we will need to write a SparseTuple
class that implements the Tuple
interface. However,
implementing all methods of the Tuple
interface is
tedious. To make it easier we derive SparseTuple
from AbstractTuple
, which already implements most
common methods. Inside SparseTuple
, we create a
TreeMap
that stores the index and value of each
nonempty field. We also keep track of the size of the tuple. With both
fields, we have the complete state of the sparse tuple. Here is the data
structure along with the getter and setter methods of
SparseTuple
:
public
class
SparseTuple
extends
AbstractTuple
{
Map
<
Integer
,
Object
>
matrix
=
new
TreeMap
<
Integer
,
Object ...
Get Programming Pig, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.