Chapter 5. Advanced Pig Latin
In the previous chapter we worked through the basics of Pig Latin. In this chapter we will plumb its depths, and we will also discuss how Pig handles more complex data flows. Finally, we will look at how to use macros and modules to modularize your scripts.
Advanced Relational Operations
We will now discuss the more advanced Pig Latin operators, as well as additional options for operators that were introduced in the previous chapter.
Advanced Features of foreach
In our introduction to foreach
(see
“foreach”), we discussed how it could take a list of expressions to
output for every record in your data pipeline. Now we will look at ways
it can explode the number of records in your pipeline, and also how it
can be used to apply a set of operations to each record.
flatten
Sometimes you have data in a bag or a tuple and you want to remove that
level of nesting. The baseball
data available on GitHub (see “Code Examples in This Book”) can be
used as an example. Because a player can play more than one position,
position
is stored in a bag. This
allows us to still have one entry per player in the baseball file.1 But when you want to switch around your data on the fly
and group by a particular position, you need a way to pull those
entries out of the bag. To do this, Pig provides the flatten
modifier in
foreach
:
--flatten.pig
players=
load
'baseball'
as
(
name:chararray
,
team:chararray
,
position:bag
{
t:(
p:chararray
)}
,
bat:map
[]);
pos=
foreach
playersgenerate
name, ...
Get Programming Pig, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.