Chapter 2. Transformations in Action

In this chapter, we will explore the most important Spark transformations (mappers and reducers) in the context of data summarization design patterns, and examine how to select specific transformations for targeted problems.

As you will see, for a given problem (we’ll use the DNA base count problem here) there are multiple possible PySpark solutions using different Spark transformations, but the efficiency of these transformations differs due to their implementation and shuffle processes (when the grouping of values by key happens). The DNA base count problem is very similar to the classic word count problem (finding the frequency of unique words in a set of files/documents), with the difference that in DNA base counting you find the frequencies of DNA letters (A, T, C, G).

I chose this problem because in solving it we will learn about data summarization, condensing a large quantity of information (here, DNA data strings/sequences) into a much smaller set of useful information (the frequency of DNA letters).

This chapter provides three complete end-to-end solutions in PySpark, using different mappers and reductions to solve the DNA base count problem. We’ll discuss the performance differences between them, and explore data summarization design patterns.

The DNA Base Count Example

The purpose of our example in this chapter is to count DNA bases in ...

Get Data Algorithms with Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.