Chapter 2. Introduction to Data Analysis with PySpark
Python is the most widely used language for data science tasks. The prospect of being able to do statistical computing and web programming using the same language contributed to its rise in popularity in the early 2010s. This has led to a thriving ecosystem of tools and a helpful community for data analysis, often referred to as the PyData ecosystem. This is a big reason for PySpark’s popularity. Being able to leverage distributed computing via Spark in Python helps data science practitioners be more productive because of familiarity with the programming language and presence of a wide community. For that same reason, we have opted to write our examples in PySpark.
It’s difficult to express how transformative it is to do all of your data munging and analysis in a single environment, regardless of where the data itself is stored and processed. It’s the sort of thing that you have to experience to understand, and we wanted to be sure that our examples captured some of that magic feeling we experienced when we first started using PySpark. For example, PySpark provides interoperability with pandas, which is one of the most popular PyData tools. We will explore this feature further in the chapter.
In this chapter, we will explore PySpark’s powerful DataFrame API via a data cleansing exercise. In PySpark, the DataFrame is an abstraction for datasets that have a regular structure in which each record is a row made up of a set of columns, ...
Get Advanced Analytics with PySpark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.