Tutorial: cleaning & tidying data in pandas
This blog post is based on a presentation from Daniel Chen at PyData DC 2018. For a more step-by-step explanation, please refer to Chen's original talk here.
Dealing with messy data is a hassle. Unfortunately, we live in a world where data isn’t always in the best format and we need to tidy it ourselves.
What makes data tidy? According to data expert Hadley Wickum (and his contribution to the Journal of Statistical Software), it has the following qualities:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This loosely translates to normalized data that is structured in such a way that each row corresponds to a set of model inputs and outputs. On a less technical level, it means that the data is in a database-friendly format that is easy to work with for analytics and predictive modeling.
Meanwhile, messy data looks like this:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
Luckily, it doesn't have to stay that way. Let's look at how you can use pandas DataFrames to clean and tidy up your messy data (and check out the accompanying notebook here).
High-level data checks
To start, let’s make sure we understand our data at a high level. For this exercise, we'll use the billboard.csv dataset. We start by loading the CSV file into a pandas DataFrame.
import pandas as pd billboard = pd.read_csv('/work/billboard.csv')
Next, we can check the column names, first rows, and indices of the DataFrame using the
From this snapshot, it appears that the column names are accurate. But some of the column headers are actual values (e.g.,
wk2, etc.), not variables, so we're dealing with messy data. Before diving into cleaning and tidying the data, let’s assess the high-level attributes of the data.
info() method enables us to visualize the number of rows (RangeIndex), number of columns, and column names, types, and counts of non-null entries.
Based on the shape of the data (317 rows and 80 columns), it appears we're dealing with a wide dataset. Wide data has more columns, whereas long data has more rows. Generally speaking, it's better to work with long because it's easier to work with and more database-friendly.
Selecting certain rows & columns
We'll be filtering rows and columns throughout our working examples. If you're new to pandas and need a refresher on selecting certain rows and columns, we suggest you refer to our tutorial on filtering with pandas.
Column headers are values, not variable names
We observed in a previous section that the
billboard DataFrame is a messy, wide dataset. This is because it contains many different
wk columns. We can tidy our data by converting the various
wk columns into values in a single column. We'll use the Pandas
melt() function to do so.
billboard_melt = billboard.melt(id_vars=['year','artist','track','time','date.entered'], var_name='week', value_name='count') billboard_melt.info()
As you can see, the resulting DataFrame has only seven columns (versus the original 81), but 24,092 rows (versus the original 317). The
melt method transformed our dataset from wide to long by combining all the columns with the
wk prefix into a single column called
Let’s unpack what’s happening in the
melt method. For further details, we recommend referring to the pandas documentation on melt here.
billboard_melt = billboard.melt( id_vars=['year','artist','track','time','date.entered'], var_name='week', value_name='count' )
id_varsis the unchanged columns whose column names won't be converted into column values. By contrast, you can also specify the column names you want to explicitly convert into column values via the
var_nameis the column name of the former column names (now values).
value_nameis the column name of the values of the former columns.
Multiple variables stored in one column
Let’s switch gears and examine the country_timeseries.csv dataset.
country_timeseries = pd.read_csv('/work/country_timeseries.csv') country_timeseries.info()
We see with this example that columns are values, not variables. Therefore, we can use the previous
melt method to transform our dataset to tidy it up.
country_timeseries_melt = country_timeseries.melt(id_vars=['Date','Day'], var_name='country_condition', value_name='count') country_timeseries_melt.info()
Although we solved our wide data issue (going from 122 rows and 18 columns to 1,952 rows and 4 columns), it appears the
country_condition column contains multiple variables within it: both country (e.g., Guinea) and the conditions (cases versus deaths). To tidy this dataset, we should split the values in the
country_condition column as follows:
country_timeseries_melt['Country'] = country_timeseries_melt['country_condition'].str.split('_').str country_timeseries_melt['Condition'] = country_timeseries_melt['country_condition'].str.split('_').str country_timeseries_melt.info()
Let’s unpack what’s happening in the following code block:
country_timeseries_melt['country_condition']returns a pandas Series with values from the
.stris a string accessor method that lets you perform string operations on Series.
.splittakes a value like
Cases_Guineaand separates it into two parts:
.strtakes the result of the
splitmethod, which is a Series of arrays, creates a string accessor, and indexes into the first element of the array in each Series row (the country in this case).
Variables are stored in both columns and rows
When variables are stored in both columns and rows, we should handle each piece independently. This issue oftentimes manifests as data sets containing a lot of repeated values in their rows. For this working example, we'll examine the weather.csv dataset.
weather = pd.read_csv('/work/weather.csv') weather.info()
The first thing that should jump out is the columns that represent values (i.e., columns with the
d prefix). We can tidy these columns by applying the
melt method as follows:
weather_melt = weather.melt(id_vars=['id','year','month','element'], var_name='day', value_name='temp') weather_melt.info()
Although this is progress, the resulting dataset contains a column named
element with what appears to be two columns:
tmax. This is because each id, year, month, and day have a minimum and maximum temperature associated with them. Both minimum and maximum change subject to those other variables and it could be useful to display them all on one row to more cleanly predict either variable.
weather_pivot = weather_melt.pivot_table(index=['id','year','month','day'], columns='element',values='temp').reset_index() weather_pivot.info()
We used the
pivot method to convert the values in the
element column into column headers. Let’s unpack what’s happening in the following code block:
weather_melt.pivot_table( index=['id','year','month','day'], columns='element', values='temp' ).reset_index()
.pivot_table()reformats data while gracefully handling duplicate values.
indexspecifies the values to keep as indexes (i.e., keep as row values once the DataFrame indexes are reset).
columnsspecifies which column values to convert into column headers.
valuesspecifies which values to associate with the new column entries.
.reset_index()converts the indexes specified by the
indexparameter back to DataFrame row values.
The net result is a table where entries in each row correspond to either an explanatory or response variable, which can easily be used for modeling and analytical purposes.
Normalizing data sets
This section could also be titled: handling cases when multiple types of observational units are stored in the same table and/or a single observational unit is stored in multiple tables.
It's often the case that we'll want to clean up data sets that have repeated column values and should instead be split apart and combined via join keys. For this working example, we can return to our billboard.csv dataset.
billboard = pd.read_csv('/work/billboard.csv')
This dataset contains the weekly ratings for different songs on the Billboard charts. We can use the
melt() method to convert the weekly ratings columns into one column with all the ratings contained in it. The problem with this, however, is that it means each row has a lot of duplicate data related to the song year, artist, track, and time.
Instead, what we should do is normalize the dataset by separating out the attributes that don’t depend on week (year, artist, track, and time) into a
billboard_songs table from the attributes that depend on week (date entered and weekly rating) into a
billboard_ratings table, then join these two tables on a common join key.
Let’s start by creating the
billboard_songs table with just the song-level information that doesn't change with respect to weekly ratings:
billboard_songs = billboard[['year','artist','track','time']].drop_duplicates() billboard_songs.head()
To create this new
billboard_songs table, we selected just the columns we were interested in, then dropped duplicate rows (i.e., all rows with the exact same values as another row). Next, we can create a join key called
id to attach these values back onto the Billboard weekly ratings data. If you're not familiar with joins, we recommend you read the pandas docs about it here.
billboard_songs['id'] = range(len(billboard_songs)) billboard_songs.head()
This code block assigns a unique id for each row that corresponds, which happens to correspond with the row’s index value.
Now that we’ve broken out song-level attributes into their own table, we can break out ratings-level attributes into a separate
billboard_ratings table. To do this, we need to assign a join key (
song_id) to each row, drop the song-level attributes from the table, and then use
melt to convert the column values into one column variable:
billboard['song_id'] = range(len(billboard)) billboard_drop = billboard.drop(['year','artist','track','time'], axis=1) billboard_ratings = billboard_drop.melt(id_vars=['song_id','date.entered'], var_name='week', value_name='rating') billboard_ratings.head()
The result is two normalized tables (
billboard_ratings) that can be easily joined together to recreate the full, human-readable picture of weekly ratings by song (and associated attributes). Even better, neither contains redundant information that takes up unnecessary space, memory, or computing resources.
We hope this tutorial gave you a better understanding of what messy data is and how to clean it up. Best of luck navigating the world of messy data!
Start cleaning and tidying messy data in Deepnote
Get started for free and start exploring your data in seconds.
Share this post
Join the world's best data teams and get started with Deepnote
No credit card required. Run your first notebook in seconds.