This blog post is based on a presentation from Daniel Chen at PyData DC 2018. For a more step-by-step explanation, please refer to Chen's original talk here.
Dealing with messy data is a hassle. Unfortunately, we live in a world where data isn’t always in the best format and we need to tidy it ourselves.
What makes data tidy? According to data expert Hadley Wickum (and his contribution to the Journal of Statistical Software), it has the following qualities:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This loosely translates to normalized data that is structured in such a way that each row corresponds to a set of model inputs and outputs. On a less technical level, it means that the data is in a database-friendly format that is easy to work with for analytics and predictive modeling.
Meanwhile, messy data looks like this:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
Luckily, it doesn't have to stay that way. Let's look at how you can use pandas DataFrames to clean and tidy up your messy data (and check out the accompanying notebook here).
High-level data checks
To start, let’s make sure we understand our data at a high level. For this exercise, we'll use the billboard.csv dataset. We start by loading the CSV file into a pandas DataFrame.
import pandas as pd
billboard = pd.read_csv('/work/billboard.csv')
Next, we can check the column names, first rows, and indices of the DataFrame using the head()
method.
billboard.head()
From this snapshot, it appears that the column names are accurate. But some of the column headers are actual values (e.g., wk1
, wk2
, etc.), not variables, so we're dealing with messy data. Before diving into cleaning and tidying the data, let’s assess the high-level attributes of the data.
billboard.info()
The pandas info()
method enables us to visualize the number of rows (RangeIndex), number of columns, column names, types, and counts of non-null entries.
Based on the shape of the data (317 rows and 80 columns), it appears we're dealing with a wide data set. Wide data has more columns, whereas long data has more rows. Generally speaking, it's better to have long data because it's easier to work with and more database-friendly.
Selecting certain rows & columns
We'll be filtering rows and columns throughout our working examples. If you're new to pandas and need a refresher on selecting certain rows and columns, we suggest you refer to our tutorial on filtering with pandas.
Working examples
Column headers are values, not variable names
We observed in a previous section that the billboard
DataFrame is a messy, wide data set. This is because it contains many different wk
columns. We can tidy our data by converting the various wk
columns into values in a single column. We'll use the pandas melt()
function to do so.
billboard_melt = billboard.melt(id_vars=['year','artist','track','time','date.entered'], var_name='week', value_name='count')
billboard_melt.info()
billboard_melt.head()
As you can see, the resulting DataFrame has only seven columns (versus the original 81), but 24,092 rows (versus the original 317). The melt
method transformed our data set from wide to long by combining all the columns with the wk
prefix into a single column called week
.
Let’s unpack what’s happening in the melt
method. For further details, we recommend referring to the pandas documentation on melt here.
billboard_melt = billboard.melt(
id_vars=['year','artist','track','time','date.entered'],
var_name='week',
value_name='count'
)
melt
inputs:
id_vars
are the unchanged columns whose column names won't be converted into column values. By contrast, you can also specify the column names you want to explicitly convert into column values via thevalue_vars
parameter.var_name
is the column name of the former column names (now values).value_name
is the column name of the values of the former columns.
Multiple variables stored in one column
Let’s switch gears and examine the country_timeseries.csv data set.
country_timeseries = pd.read_csv('/work/country_timeseries.csv')
country_timeseries.info()
We see with this example that columns are values, not variables. Therefore, we can use the previous melt
method to transform our data set to tidy it up.
country_timeseries_melt = country_timeseries.melt(id_vars=['Date','Day'], var_name='country_condition', value_name='count')
country_timeseries_melt.info()
country_timeseries_melt.head()
Although we solved our wide data issue (going from 122 rows and 18 columns to 1,952 rows and 4 columns), it appears the country_condition
column contains multiple variables within it: both country (e.g., Guinea) and the conditions (cases versus deaths). To tidy this data set, we should split the values in the country_condition
column as follows:
country_timeseries_melt['Country'] = country_timeseries_melt['country_condition'].str.split('_').str[0]
country_timeseries_melt['Condition'] = country_timeseries_melt['country_condition'].str.split('_').str[1]
country_timeseries_melt.info()
Let’s unpack what’s happening in the following code block:
country_timeseries_melt['country_condition'].str.split('_').str[0]
country_timeseries_melt['country_condition']
returns a pandas Series with values from thecountry_condition
columns..str
is a string accessor method that lets you perform string operations on Series..split
takes a value likeCases_Guinea
and separates it into two parts:Cases
andGuinea
..str[0]
takes the result of thesplit
method, which is a Series of arrays, creates a string accessor, and indexes into the first element of the array in each Series row (the country in this case).
Variables are stored in both columns and rows
When variables are stored in both columns and rows, we should handle each piece independently. This issue often manifests as data sets containing a lot of repeated values in their rows. For this working example, we'll examine the weather.csv data set.
weather = pd.read_csv('/work/weather.csv')
weather.info()
weather.head()
The first thing that should jump out is the columns that represent values (i.e., columns with the d
prefix). We can tidy these columns by applying the melt
method as follows:
weather_melt = weather.melt(id_vars=['id','year','month','element'], var_name='day', value_name='temp')
weather_melt.info()
weather_melt.head()
Although this is progress, the resulting data set contains a column named element
with what appears to be two columns: tmin
and tmax
. This is because each ID, year, month, and day have a minimum and maximum temperature associated with them. Both minimum and maximum change subject to those other variables and it could be useful to display them all on one row to more cleanly predict either variable.
weather_pivot = weather_melt.pivot_table(index=['id','year','month','day'], columns='element',values='temp').reset_index()
weather_pivot.info()
weather_pivot.head()
We used the pivot
method to convert the values in the element
column into column headers. Let’s unpack what’s happening in the following code block:
weather_melt.pivot_table(
index=['id','year','month','day'],
columns='element',
values='temp'
).reset_index()
.pivot_table()
reformats data while gracefully handling duplicate values.index
specifies the values to keep as indexes (i.e., keep as row values once the DataFrame indexes are reset).columns
specifies which column values to convert into column headers.values
specifies which values to associate with the new column entries..reset_index()
converts the indexes specified by theindex
parameter back to DataFrame row values.
The net result is a table where entries in each row correspond to either an explanatory or response variable, which can easily be used for modeling and analytical purposes.
Normalizing data sets
This section could also be titled: handling cases when multiple types of observational units are stored in the same table and/or a single observational unit is stored in multiple tables.
It's often the case that we'll want to clean up data sets that have repeated column values and should instead be split apart and combined via join keys. For this working example, we can return to our billboard.csv data set.
billboard = pd.read_csv('/work/billboard.csv')
This data set contains the weekly ratings for different songs on the Billboard charts. We can use the melt()
method to convert the weekly ratings columns into one column with all the ratings contained in it. The problem with this, however, is that it means each row has a lot of duplicate data related to the song year, artist, track, and time.
Instead, what we should do is normalize the data set by separating out the attributes that don’t depend on week (year, artist, track, and time) into a billboard_songs
table from the attributes that depend on week (date entered and weekly rating) into a billboard_ratings
table, then join these two tables on a common join key.
Let’s start by creating the billboard_songs
table with just the song-level information that doesn't change with respect to weekly ratings:
billboard_songs = billboard[['year','artist','track','time']].drop_duplicates()
billboard_songs.head()
To create this new billboard_songs
table, we selected just the columns we were interested in, then dropped duplicate rows (i.e., all rows with the exact same values as another row). Next, we can create a join key called id
to attach these values back onto the Billboard weekly ratings data. If you're not familiar with joins, we recommend you read the pandas docs about it here.
billboard_songs['id'] = range(len(billboard_songs))
billboard_songs.head()
This code block assigns a unique ID for each row that corresponds, which happens to correspond with the row’s index value.
Now that we’ve broken out song-level attributes into their own table, we can break out ratings-level attributes into a separate billboard_ratings
table. To do this, we need to assign a join key (song_id
) to each row, drop the song-level attributes from the table, and then use melt
to convert the column values into one column variable:
billboard['song_id'] = range(len(billboard))
billboard_drop = billboard.drop(['year','artist','track','time'], axis=1)
billboard_ratings = billboard_drop.melt(id_vars=['song_id','date.entered'], var_name='week', value_name='rating')
billboard_ratings.head()
The result is two normalized tables (billboard_songs
and billboard_ratings
) that can be easily joined together to recreate the full, human-readable picture of weekly ratings by song (and associated attributes). Even better, neither contains redundant information that takes up unnecessary space, memory, or computing resources.
We hope this tutorial gave you a better understanding of what messy data is and how to clean it up. Best of luck navigating the world of messy data!
Start cleaning and tidying messy data in Deepnote
Get started for free and start exploring your data in seconds.