Chapter 2. Transformations in Action
In this chapter, we will explore the most important Spark transformations (mappers and reducers) in the context of data summarization design patterns, and examine how to select specific transformations for targeted problems.
As you will see, for a given
problem (we’ll use the DNA base count problem here)
there are multiple possible PySpark solutions
using different Spark transformations,
but the efficiency of these transformations
differs due to their implementation and
shuffle processes (when the grouping of values
by key happens). The DNA base count problem
is very similar to the classic word count problem
(finding the frequency of unique words in a
set of files/documents), with the difference
that in DNA base counting you find the
frequencies of DNA letters (A
, T
, C
,
G
).
I chose this problem because in solving it we will learn about data summarization, condensing a large quantity of information (here, DNA data strings/sequences) into a much smaller set of useful information (the frequency of DNA letters).
This chapter provides three complete end-to-end solutions in PySpark, using different mappers and reductions to solve the DNA base count problem. We’ll discuss the performance differences between them, and explore data summarization design patterns.
The DNA Base Count Example
The purpose of our example in this chapter is to count DNA bases in ...
Get Data Algorithms with Spark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.