Data Indexing and Selection
Python Data Science Handbook: Early Release
In the previous chapter, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays. These included indexing (e.g. arr[2, 1]
), slicing (e.g. arr[:, 1:5]
), masking (e.g. arr[arr > 0]
), fancy indexing (e.g. arr[0, [1, 5]]
), and combinations thereof (e.g. arr[:, [1, 5]]
). Here we’ll look at similar means of accessing and modifying values in Pandas Series
and DataFrame
objects. If you have used the NumPy patterns mentioned above, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.
We’ll start with the simple case of the one-dimensional Series
object, and then move on to the more complicated two-dimesnional DataFrame
object.
Data Selection in Series
As we saw in the previous section, a Series
object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.
Series
as dictionary
Like a dictionary, the Series
object provides a mapping from a collection of keys to a collection of values:
import pandas as pd data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd']) data['b']
We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:
'a' in data
data.keys()
list(data.items())
DataFrame
objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a series by assigning to a new index value:
data['e'] = 1.25 data
This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.
Series
as 1D Array
A Series
builds on this dictionary-like interface and provides array-style item selection via slices, masking, and fancy indexing, examples of which can be seen below:
# slicing by explicit index data['a':'c']
# slicing by implicit integer index data[0:2]
# masking data[(data > 0.3) & (data < 0.8)]
# fancy indexing data[['a', 'e']]
Among these, slicing may be the source of the most confusion. Notice that when slicing with an explicit index (i.e. data['a':'c']
), the final index is included in the slice, while when slicing with an implicit index (i.e. data[0:2]
), the final index is excluded from the slice.
Indexers: loc
, iloc
, and ix
The slicing and indexing conventions above can be a source of confusion. For example, if your series has an explicit integer index, an indexing operation such as data[1]
will use the explicit indices, while a slicing operation like data[1:3]
will use the implicit Python-style index.
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5]) data
# explicit index when indexing data[1]
# implicit index when slicing data[1:3]
Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes which explicitly access certain indexing schemes. These are not functional methods, but attributes which expose a particular slicing interface to the data in the Series
.
First, the loc
attribute allows indexing and slicing which always references the explicit index:
data.loc[1]
data.loc[1:3]
The iloc
attribute allows indexing and slicing which always references the implicit Python-style index:
data.iloc[1]
data.iloc[1:3]
A third indexing attribute, ix
, is a hybrid of the two, and for Series objects is equivalent to standard []
-based indexing. The purpose of the ix
indexer will become more apparent in the context of DataFrame objects, below.
One guiding principle of Python code (see the Zen of Python, section X.X) is that “explicit is better than implicit”. The explicit nature of loc
and iloc
make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.
Data Selection in DataFrame
Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and acts in many ways like a dictionary of Series structures sharing the same index These analogies can be helpful to keep in mind as we explore data selection within this structure.
DataFrame
as a Dictionary
The first analogy we will consider is the DataFrame
as a dictionary of related Series
objects. Let’s return to our example of areas and populations of states:
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}) pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}) data = pd.DataFrame({'area':area, 'pop':pop}) data
The individual Series
which make up the columns of the dataframe can be accessed via dictionary-style indexing of the column name:
data['area']
Equivalently, we can use attribute-style access with column names which are strings:
data.area
This attribute-style column access actually accesses the exact same object as the dictionary-style access:
data.area is data['area']
Though this is a useful shorthand, keep in mind that it does not work for all cases! For example, if the column names are not strings, or if the column names conflict with methods of the dataframe, this attribute-style access is not possible. For example, the DataFrame has a pop
method, so data.pop
will point to this rather than the "pop"
column:
data.pop is data['pop']
Like with the Series
objects above, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:
data['density'] = data['pop'] / data['area'] data
This shows a preview of the straightforward syntax of element-by-element arithmetic between Series
objects; we’ll dig into this further in section X.X.
DataFrame
as Two-dimensional Array
As mentioned, we can also view the dataframe as an enhanced two-dimensional array. We can examine the raw underlying data array using the values
attribute:
data.values
With this picture in mind, many familiar array-like observations can be done on the dataframe itself. For example, we can transpose the full dataframe to swap rows and columns:
data.transpose()
When it comes to indexing of DataFrame
objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:
data.values[0]
While passing a single “index” to a dataframe accesses a column:
data['area']
Thus for array-style indexing, we need another convention. Here Pandas again uses the loc
, iloc
, and ix
indexers mentioned above. Using the iloc
indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame
index and column labels are maintained in the result:
data.iloc[:3, :2]
Similarly, using the loc
indexer we can index the underlying data in an array-like style but using the explicit index and column names:
data.loc[:'Illinois', :'pop']
The ix
indexer allows a hybrid of these two approaches:
data.ix[:3, :'pop']
Keep in mind that for integer indices, the ix
indexer is subject to the same potential sources of confusion as discussed for integer-indexed Series
objects above.
Any of the familiar NumPy-style data access patterns can be used within these indexers. For example, in the loc
indexer we can combine masking and fancy indexing as in the following:
data.loc[data.density > 100, ['pop', 'density']]
Keep in mind also that any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be used to from NumPy:
data.iloc[0, 2] = 90 data
To built-up your fluency in Pandas data manipulation, I suggest spending some time with a simple DataFrame
and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.
Additional Indexing Conventions
There are a couple extra indexing conventions which might seem a bit inconsistent with the above discussion, but nevertheless can be very useful in practice. First, while direct integer indices are not allowed on DataFrames, direct integer slices are allowed, and are taken on rows rather than on columns as you might expect:
data[1:3]
Similarly, direct masking operations are also interpreted row-wise rather than column-wise:
data[data.density > 100]
These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the above conventions they are nevertheless quite useful in practice.
Summary
Here we have discussed the various ways to access and modify values within the basic Pandas data structures. With this, we’re slowly building-up our fluency with manipulating and operating on labeled data within Pandas. In the next section, we’ll take this a bit farther and begin to examine the types of operations that you can do on Pandas Series
and DataFrame
objects.