Chapter 3. Data Manipulation with Pandas
In the previous chapter, we dove into detail on NumPy and its ndarray
object, which provides efficient storage and manipulation of dense typed
arrays in Python. Here we’ll build on this knowledge by looking in
detail at the data structures provided by the Pandas library. Pandas is
a newer package built on top of NumPy, and provides an efficient
implementation of a DataFrame
. DataFrame
s are essentially
multidimensional arrays with attached row and column labels, and often
with heterogeneous types and/or missing data. As well as offering a
convenient storage interface for labeled data, Pandas implements a
number of powerful data operations familiar to users of both database
frameworks and spreadsheet programs.
As we saw, NumPy’s ndarray
data structure provides essential features
for the type of clean, well-organized data typically seen in numerical
computing tasks. While it serves this purpose very well, its limitations
become clear when we need more flexibility (attaching labels to
data, working with missing data, etc.) and when attempting operations
that do not map well to element-wise broadcasting (groupings,
pivots, etc.), each of which is an important piece of analyzing the less
structured data available in many forms in the world around us. Pandas,
and in particular its Series
and DataFrame
objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s ...
Get Python Data Science Handbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.