Tutorial: cleaning and tidying data in pandas

Don't let sloppy data get you down. Learn how to clean and tidy up messy data with pandas DataFrames.

This blog post is based on a presentation from Daniel Chen at PyData DC 2018. For a more step-by-step explanation, please refer to Chen's original talk here.

Dealing with messy data is a hassle. Unfortunately, we live in a world where data isn’t always in the best format and we need to tidy it ourselves.

What makes data tidy? According to data expert Hadley Wickum (and his contribution to the Journal of Statistical Software), it has the following qualities:

⁠Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

This loosely translates to normalized data that is structured in such a way that each row corresponds to a set of model inputs and outputs. On a less technical level, it means that the data is in a database-friendly format that is easy to work with for analytics and predictive modeling.

Meanwhile, messy data looks like this:

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

Luckily, it doesn't have to stay that way. Let's look at how you can use pandas DataFrames to clean and tidy up your messy data (and check out the accompanying notebook here).

High-level data checks

To start, let’s make sure we understand our data at a high level. For this exercise, we'll use the billboard.csv dataset. We start by loading the CSV file into a pandas DataFrame.

import pandas as pd
billboard = pd.read_csv('/work/billboard.csv')

Next, we can check the column names, first rows, and indices of the DataFrame using the head() method.

billboard.head()

From this snapshot, it appears that the column names are accurate. But some of the column headers are actual values (e.g., wk1, wk2, etc.), not variables, so we're dealing with messy data. Before diving into cleaning and tidying the data, let’s assess the high-level attributes of the data.

billboard.info()

The pandas info() method enables us to visualize the number of rows (RangeIndex), number of columns, column names, types, and counts of non-null entries.

Based on the shape of the data (317 rows and 80 columns), it appears we're dealing with a wide data set. Wide data has more columns, whereas long data has more rows. Generally speaking, it's better to have long data because it's easier to work with and more database-friendly.

Selecting certain rows & columns

We'll be filtering rows and columns throughout our working examples. If you're new to pandas and need a refresher on selecting certain rows and columns, we suggest you refer to our tutorial on filtering with pandas.

Working examples

Column headers are values, not variable names

We observed in a previous section that the billboard DataFrame is a messy, wide data set. This is because it contains many different wk columns. We can tidy our data by converting the various wk columns into values in a single column. We'll use the pandas melt() function to do so.

billboard_melt = billboard.melt(id_vars=['year','artist','track','time','date.entered'], var_name='week', value_name='count')
billboard_melt.info()

billboard_melt.head()

As you can see, the resulting DataFrame has only seven columns (versus the original 81), but 24,092 rows (versus the original 317). The melt method transformed our data set from wide to long by combining all the columns with the wk prefix into a single column called week.

Let’s unpack what’s happening in the melt method. For further details, we recommend referring to the pandas documentation on melt here.

billboard_melt = billboard.melt(
	id_vars=['year','artist','track','time','date.entered'], 
	var_name='week', 
	value_name='count'
)

melt inputs:

id_vars are the unchanged columns whose column names won't be converted into column values. By contrast, you can also specify the column names you want to explicitly convert into column values via the value_vars parameter.
var_name is the column name of the former column names (now values).
value_name is the column name of the values of the former columns.

Multiple variables stored in one column

Let’s switch gears and examine the country_timeseries.csv data set.

country_timeseries = pd.read_csv('/work/country_timeseries.csv')
country_timeseries.info()

We see with this example that columns are values, not variables. Therefore, we can use the previous melt method to transform our data set to tidy it up.

country_timeseries_melt = country_timeseries.melt(id_vars=['Date','Day'], var_name='country_condition', value_name='count')
country_timeseries_melt.info()

country_timeseries_melt.head()

Although we solved our wide data issue (going from 122 rows and 18 columns to 1,952 rows and 4 columns), it appears the country_condition column contains multiple variables within it: both country (e.g., Guinea) and the conditions (cases versus deaths). To tidy this data set, we should split the values in the country_condition column as follows:

country_timeseries_melt['Country'] = country_timeseries_melt['country_condition'].str.split('_').str[0]
country_timeseries_melt['Condition'] = country_timeseries_melt['country_condition'].str.split('_').str[1]
country_timeseries_melt.info()

Let’s unpack what’s happening in the following code block:

country_timeseries_melt['country_condition'].str.split('_').str[0]

country_timeseries_melt['country_condition'] returns a pandas Series with values from the country_condition columns.
.str is a string accessor method that lets you perform string operations on Series.
.split takes a value like Cases_Guinea and separates it into two parts: Cases and Guinea.
.str[0] takes the result of the split method, which is a Series of arrays, creates a string accessor, and indexes into the first element of the array in each Series row (the country in this case).

Variables are stored in both columns and rows

When variables are stored in both columns and rows, we should handle each piece independently. This issue often manifests as data sets containing a lot of repeated values in their rows. For this working example, we'll examine the weather.csv data set.

weather = pd.read_csv('/work/weather.csv')
weather.info()

weather.head()

The first thing that should jump out is the columns that represent values (i.e., columns with the d prefix). We can tidy these columns by applying the melt method as follows:

weather_melt = weather.melt(id_vars=['id','year','month','element'], var_name='day', value_name='temp')
weather_melt.info()

weather_melt.head()

Although this is progress, the resulting data set contains a column named element with what appears to be two columns: tmin and tmax. This is because each ID, year, month, and day have a minimum and maximum temperature associated with them. Both minimum and maximum change subject to those other variables and it could be useful to display them all on one row to more cleanly predict either variable.

weather_pivot = weather_melt.pivot_table(index=['id','year','month','day'], columns='element',values='temp').reset_index()
weather_pivot.info()

weather_pivot.head()

We used the pivot method to convert the values in the element column into column headers. Let’s unpack what’s happening in the following code block:

weather_melt.pivot_table(
	index=['id','year','month','day'], 
	columns='element',
	values='temp'
).reset_index()

.pivot_table() reformats data while gracefully handling duplicate values.
index specifies the values to keep as indexes (i.e., keep as row values once the DataFrame indexes are reset).
columns specifies which column values to convert into column headers.
values specifies which values to associate with the new column entries.
.reset_index() converts the indexes specified by the index parameter back to DataFrame row values.

The net result is a table where entries in each row correspond to either an explanatory or response variable, which can easily be used for modeling and analytical purposes.

Normalizing data sets

This section could also be titled: handling cases when multiple types of observational units are stored in the same table and/or a single observational unit is stored in multiple tables.

It's often the case that we'll want to clean up data sets that have repeated column values and should instead be split apart and combined via join keys. For this working example, we can return to our billboard.csv data set.

billboard = pd.read_csv('/work/billboard.csv')

This data set contains the weekly ratings for different songs on the Billboard charts. We can use the melt() method to convert the weekly ratings columns into one column with all the ratings contained in it. The problem with this, however, is that it means each row has a lot of duplicate data related to the song year, artist, track, and time.

Instead, what we should do is normalize the data set by separating out the attributes that don’t depend on week (year, artist, track, and time) into a billboard_songs table from the attributes that depend on week (date entered and weekly rating) into a billboard_ratings table, then join these two tables on a common join key.

Let’s start by creating the billboard_songs table with just the song-level information that doesn't change with respect to weekly ratings:

billboard_songs = billboard[['year','artist','track','time']].drop_duplicates()
billboard_songs.head()

To create this new billboard_songs table, we selected just the columns we were interested in, then dropped duplicate rows (i.e., all rows with the exact same values as another row). Next, we can create a join key called id to attach these values back onto the Billboard weekly ratings data. If you're not familiar with joins, we recommend you read the pandas docs about it here.

billboard_songs['id'] = range(len(billboard_songs))
billboard_songs.head()

This code block assigns a unique ID for each row that corresponds, which happens to correspond with the row’s index value.

Now that we’ve broken out song-level attributes into their own table, we can break out ratings-level attributes into a separate billboard_ratings table. To do this, we need to assign a join key (song_id) to each row, drop the song-level attributes from the table, and then use melt to convert the column values into one column variable:

billboard['song_id'] = range(len(billboard))
billboard_drop = billboard.drop(['year','artist','track','time'], axis=1)
billboard_ratings = billboard_drop.melt(id_vars=['song_id','date.entered'], var_name='week', value_name='rating')
billboard_ratings.head()

The result is two normalized tables (billboard_songs and billboard_ratings) that can be easily joined together to recreate the full, human-readable picture of weekly ratings by song (and associated attributes). Even better, neither contains redundant information that takes up unnecessary space, memory, or computing resources.

We hope this tutorial gave you a better understanding of what messy data is and how to clean it up. Best of luck navigating the world of messy data!

Start cleaning and tidying messy data in Deepnote

Get started for free and start exploring your data in seconds.

The notebook manifesto

Data analytics

Data engineering

Machine learning

Fintech & Finance

Biotechnology

Gaming

Enterprise

Startups

Research

Use cases

Tutorial: cleaning & tidying data in pandas

High-level data checks

Selecting certain rows & columns

Working examples

Column headers are values, not variable names

Multiple variables stored in one column

Variables are stored in both columns and rows

Normalizing data sets

Start cleaning and tidying messy data in Deepnote

Blog

Ultimate guide to huggingface_hub library in Python

Ultimate guide to torchvision library in Python

How we made data apps 40% faster

That’s it, time to try Deepnote