Tutorial: filtering with pandas
When performing data analysis, it's essential to be able to filter data based on certain criteria. For instance, you might want to explore the GDP of the wealthiest countries or aggregate the rainfall of certain states from 2001-2010. Luckily for us, we can leverage the Python pandas framework to filter data for a very wide range of use cases.
We'll walk through a number of examples that use pandas to filter data. Although we have embedded code snippets and results, please feel free to follow along and run the code yourself using this Deepnote workbook.
A quick intro to pandas
Pandas is a powerful data analysis and manipulation tool built on top of the Python programming language. Pandas makes it easier to explore, clean, and process data using two core data structures: Series and DataFrames:
- Series: one-dimensional labeled homogenous arrays with fixed size and mutable data
- DataFrames: Two-dimensional labeled arrays of potentially heterogenous tabular data with mutable size and mutable data (DataFrames can also be thought of as dict-like containers for Series objects).
Series and DataFrames are both built on top of NumPy arrays. To understand how to filter with pandas, let’s first dive into how filtering works in NumPy.
Filtering with NumPy arrays
Numpy arrays operate like indexed lists. Array indexes enable us to select certain elements based on their ordinal position. For instance,
indexed_array in the below example has five elements, each corresponding to an index value from 0-4.
If we want to select the first, third, and fifth scalar elements, we can do the following:
indexed_array = np.array(['a','b','c','d','e']) indexed_array #returns first element ('a') indexed_array #returns third element ('c') indexed_array #returns last element ('e')
NumPy also makes it possible to select multiple elements from an array by passing in a list of indexes, the result of which is a NumPy array.
It's also possible to filter NumPy arrays by passing in a list of Boolean values corresponding to the array indexes. Note that the dimensions of the Boolean list must match those of the NumPy array. For instance, if we want to select the first, third, and last elements of
indexed_array using Booleans, we can do the following:
Using Python’s list comprehension, we can also apply logical arguments to translate nearly any list into a list of Booleans. We can then use that list of Booleans to filter the indexes of the NumPy array.
NumPy arrays use a mechanism called broadcasting to apply element-wise operations on each element in an array. This allows us to compute Boolean values using simple logical arguments, as can be seen in the following example:
Putting all this together, we can combine Boolean indexing, list comprehension, and broadcasting to filter NumPy arrays in a variety of ways.
Filtering pandas rows
Because pandas data structures use NumPy arrays, we can use many of the techniques covered in the previous section to filter pandas rows. Pandas has additional functionality, like
.query() functions, that provide powerful filtering tools that are not available with vanilla NumPy arrays.
Similar to NumPy arrays, we can use indexing to filter pandas rows. For instance, we can return all rows corresponding to the
'Amazon' indexes using the pandas native
.loc[row_indexes, columns] function.
Or we can return all rows corresponding to numerical indexes
[0:2] using the pandas native
.iloc[numerical_row_indexes, columns] function.
Similar to NumPy arrays, we can filter rows in pandas data structures by passing in a list of Boolean values that correspond one to one with the indexes of each row. Let’s explore some ways to filter rows in pandas using Boolean lists below.
We can create pandas filters by applying element-wise operations on each element in a Series.
List comprehension also enables us to create a list of Booleans to filter pandas rows by iterating over a list and applying logic to each element.
Compound logical statements
We can also apply compound logical statements to describe more complex criteria across multiple Series.
As previously mentioned, pandas provides functionality beyond what is possible with NumPy arrays. For instance, pandas has a native
.query() function that helps us write SQL-like statements to filter DataFrames.
Indexing vs. conditionals vs. querying
With so many ways to filter pandas rows, you may be wondering how to choose which technique to apply. Indexing and conditionals are generally the fastest filtering method while querying usually results in the cleanest code, especially for more complex filtering statements (but the results generally take the longest to execute).
Although we have laid out some straightforward heuristics to help you decide, understanding which technique best fits which scenario will likely come with practice.
Selecting pandas columns
Selecting columns in pandas DataFrames is generally more straightforward than filtering rows. Below we will examine how to filter DataFrame columns based on index, name, and data type.
Selecting using indexes
Similar to pandas rows, DataFrame columns can also be filtered on their indexes using the
.loc function and numerical indexes using the
Selecting using column names
Pandas DataFrames also make it easy to select columns by passing in a list of column names.
Selecting using data types
Each column in a pandas DataFrame has an assigned data type (dtype). DataFrames columns can also be filtered by selecting only columns that match certain dtypes.
Common scenarios for using pandas filters
There are limitless ways we can apply these techniques to filter pandas data structures. To help reinforce some of these concepts, we'll go through some of the most common scenarios for filtering pandas rows.
Below we have a DataFrame called
blizzards containing the names, dates, snowfall (in centimeters), and locations of the biggest blizzards in US history. We'll use this DataFrame to walk through some common questions pandas users have about filtering.
Filtering by substring criteria
Filtering by substring criteria can be useful when you want to select rows based on partial string matches for a subset of the column values. For example, we could do the following if we wanted to filter our
blizzards DataFrame down to just the events with
Snow in the name (index).
Filtering on dates
Filtering on dates is also a very common use case. Let’s say we wanted to filter
blizzards rows down to just the
events in the 20th century (between 1900-1999). Pandas enables us to do this without explicitly casting the date objects into datetime objects.
Filtering using "in" and "not in"
Oftentimes we will want to only select rows with values that are part of (or not part of) a pre-determined set. For instance, we can look at only the major snowfall events that took place in
New York and
We can also use similar techniques to filter rows to exclude events that took place in
New York and
Operator chaining enables us to perform complex filtering logic. This is also when the DataFrames
.query() function becomes especially useful. For instance, let’s filter our
blizzards DataFrame down to only rows that had
greater than 100 cm of snowfall and
occurred after 1900 or
took place in a region that starts with 'New'.
Conclusion & further reading
We should now have a better understanding of how to filter with NumPy arrays, how to apply and build off those techniques to filter pandas rows, how to select pandas columns, and when to use different pandas filtering techniques.
Although we covered a lot in this post, we've only scratched the surface of using pandas to perform data analysis and data engineering.
For more information, please refer to the following links:
Share this post