A Beginner’s Guide to Exploratory Data Analysis with Python
Exploratory Data Analysis (EDA) is used by data professionals to explore, investigate and familiarize themselves with the characteristics of a dataset and the relationships between its variables. EDA makes use of a wide variety of tools and techniques, but key amongst these is data visualization. By analyzing and visualizing data through EDA, we can get a true sense of what the data looks like and what kinds of questions we can answer from it. It’s also a way to discover trends and patterns, spot outliers and other anomalies, and answer some key research questions.
Pioneered by the American mathematician John Tukey, the concept of EDA can be traced back to the 1970s, yet it is still widely used today as a crucial step in the data discovery process, before more sophisticated data analysis and machine learning is performed.
In this tutorial, we will be exploring this concept further by conducting EDA on a dataset using Python. Aside from base Python, we shall also be making use of 4 Python modules for our EDA project:
Numpy: A core Python library for scientific computing using high-performance arrays.
Pandas: A fast and powerful data analysis and manipulation tool.
Matplotlib: A comprehensive library for creating visualizations in Python.
Seaborn: Another data visualization library built on top of Matplotlib.
We’ll be using a dataset that contains client information for an insurance company. Each of the 10,000 rows in the dataset corresponds to a single client, with 19 variables recording a variety of client specific information.
Importing the relevant libraries
Before loading the dataset, we will need to first import all the relevant libraries. If you don’t have them installed already, you can do so by using the pip install command.
Loading the data
We can now load the dataset into pandas using the read_csv() function. This converts the CSV file into a Pandas dataframe.
Viewing the dataframe
We can get a quick sense of the size of our dataset by using the shape method. This returns a tuple with the number of rows and columns in the dataset.
Now let's preview the first 5 rows.
Using the info() method, we can glean more information on the dataset including the names of the different columns and their corresponding data types as well as the number of non-null values.
Preparing the data
While our dataset does not appear to have any serious issues, we will nonetheless have to do some basic cleaning and transformation to get it ready for the main EDA task.
Missing values
We will start by checking the dataset for missing or null values. For this, we can use the isna() method which returns a dataframe of boolean values indicating if a field is null or not. To group all missing values by column, we can include the sum() method.
We now have an idea of what data is missing and where. Typically, we have two options: delete rows that contain missing data or replace them with a value. In our case, deleting that many rows may affect our analysis, so we will go ahead and replace the values instead.
Several different methods exist for imputing missing values and what works best usually depends on the characteristics of the dataset in question as well as the objective of the analysis. One of the simplest methods is by replacing the null values in each column with the column mean or mode.
We will begin with the “credit_score” column. Since credit scores are heavily influenced by one’s income situation, it would be a better idea to impute the missing values in this column based on the mean credit score for the income group an individual belongs in. We can first run a groupby() method to see how the mean values for each income group differ.
The mean credit scores for each group do differ widely as we suspected. We can go ahead and impute the missing values for the “credit_score” column using the mean credit score for each income group. The simplest way to do this would be by creating a function so we don’t have to repeat codes for each income group.
We can now apply our custom function to the dataframe.
We no longer have any missing values in the “credit_score” column. We can now tackle the missing values in the “annual_mileage” column. This time, we will do a groupby of the “driving_experience” column and compare the means of each group in the column.
Unlike the “credit_score” column, the mean for the different groups in the “driving_experience” do not vary too widely so we can simply impute the null values using the column mean.
We no longer have any null values in our dataset.
Dropping columns
Both the “id” and “postal_code” columns will not be relevant for our analysis, so we can get rid of these using the drop() method. We will set the “axis” argument to 1 since we’re dealing with columns, and set the “inplace” argument to True to make the change permanent.
The data preparation section of our project is now complete, and we can shift our attention to our main task, analyzing the data.
Analyzing the data
With our cleansed dataset we can go ahead and begin the task of exploring the data. While several different analyses exist for EDA, we can group them under three large umbrellas; univariate analysis, bivariate analysis, and multivariate analysis. We will look at each one of these in turn.
Univariate analysis
Univariate analysis is the simplest form of analyzing data. As the name implies, it deals with analyzing data within a single column or variable and is mostly used to describe data. There are different kinds of univariate analyses.
Categorical unordered: This type of data has no order or ranking, and is categorical as opposed to numerical. Our “gender” column contains two sub-categories that describe whether a client is male or female. We can get a count of each category by using the value_counts() method.
Better yet, we can visualize this information using a countplot from Seaborn.
Categorical ordered: This type of data has a natural rank and progression. Examples from our dataset include “education” and “income”. Let’s explore the income variable using a pie chart.
The largest category is “upper class”, representing 43% of the total, followed by “middle class” (21%), poverty (18%), and “working class” (17%). Now let’s explore the “education” variable using a countplot.
There are more clients with a high school education than any other category, followed by university graduates and then clients with no education.
Numeric: The third type of univariate analysis uses numerical data. Univariate numeric data is usually analyzed by calculating functions like the mean, mode, max, min, standard deviation etc. One easy way to get these summary statistics on a numerical column is by using the describe() method. Let’s try this on the “credit_score” column.
This is great information, but it doesn’t tell us how the data is distributed. A histogram is a great way to visualize the frequency distribution of numerical data. We can plot one using the histplot() function in Seaborn.
The “credit_score” column follows a normal distribution or bell curve. Let’s create another histogram for the “annual_mileage” column, but this time we will include a kernel density estimation (kde) to show smoothness or continuity.
Another bell curve, confirming that data near the mean are more frequent in occurrence than data far from the mean.
Bivariate analysis
Bivariate analysis involves analyzing data with two variables or columns. This is usually a way to explore the relationships between these variables and how they influence each other, if at all. A bivariate analysis could take one of three different forms: numeric-numeric, numeric-categorical and categorical-categorical.
Numeric-Numeric: Scatter plots are a common way to compare two numeric variables. Let’s investigate the relationship between “annual_mileage” and “speeding_violations”.
From the graph, we can infer a negative correlation between annual mileage and the number of speeding violations. This means the more miles a client drives per year, the fewer speeding violations they commit.
We could also use a correlation matrix to get more specific information about the relationship between these two variables. A correlation matrix is useful for identifying the relationship between several variables. As an example, let’s create a matrix using the”speeding_violations”, DUIs”, and “past_accidents” columns.
All our variables exhibit a positive correlation with each other, meaning when one goes up the other goes up as well and vice-versa. But how do we interpret the strength of this relationship? Generally speaking, a correlation coefficient between 0.5 and 0.7 indicates variables that can be considered moderately correlated, while a correlation coefficient whose magnitude is between 0.3 and 0.5 indicates variables that exhibit weak correlation, as is the case with most of our variables. This means a moderate, positive correlation exists between the number of past accidents and speeding violations, while a weak, positive correlation exists between the number of past accidents and DUIs.
The best way to visualize correlation however, is with a heatmap. We can easily create one by passing the correlation matrix into the heatmap() function in Seaborn.
Numeric-Categorical: Here, we analyze data using one set of numeric variables and another set of categorical variables. Analysis can be done by using the mean and median as in the example below. We first group by “outcome” and then calculate the mean “annual_mileage” for each group.
Using this method, we could return the minimum, maximum, or median annual mileage for each category by using the min(), max(), and median() methods respectively. However, we can better visualize the difference in dispersion or variability between two variables by using box plots. Box plots display a five-number summary of a set of data; the minimum, first quartile, median, third quartile, and maximum.
Both variables have similar medians (denoted by the middle line that runs through the box) though clients who made a claim have slightly higher median annual mileage than clients who didn’t. The same can be said for the first and third quartiles (denoted by the lower and upper borders of the box respectively).
Similarly, we can compare the distributions of the two categories in “outcome” based on their credit scores, but this time we’ll make use of a bivariate histogram by setting the “hue” argument in the histplot() function to “outcome”.
Categorical-Categorical: As you may have guessed by now, this involves a set of two categorical variables. As an example, we will explore how the “outcome” variable relates to categories like age and vehicle year. To begin, we will convert the labels in the outcome column from True and False to 1s and 0s respectively. This will allow us to calculate the claim rate for any group of clients.
Half as many clients made a claim in the past year compared to those who didn’t. Now let’s check how the claim rate is distributed between the different categories of age.
From the above, it is clear that younger people are more likely to make an insurance claim. We can do the same for “vehicle_year”.
Clients with older vehicles are much more likely to file a claim. Another way to visualize the claim rate is by using probability bar charts.
Clients with no education are more likely to file a claim compared to high school and university graduates, while clients in the “poverty” income group are more likely to file a claim, followed by clients in the “working class” and “middle class” categories, in that order.
Multivariate analysis
This comprises data analysis involving more than two variables. A common type of multivariate analysis is the heatmap. Heatmaps provide a fast and simple way for visual recognition of patterns and trends. We can easily check the relationship between variables in our data set like “education” and “income” by using a third variable, claim rate. First, we will create a pivot table.
We can then pass in our pivot table to the heatmap() function in Seaborn.
High school graduates in the poverty income class have the highest claim rate, followed by university graduates in the poverty income class. Clients in the upper class income category with no education have the lowest claim rates.
Let’s do the same for driving experience and marital status.
Unmarried individuals with 0–9 years of driving experience are the most likely to file a claim, while married individuals with 30+ years of driving experience are the least likely to file a claim.
Finally, let’s create a heatmap to visualize gender, family status, and claim rate.
Men without children are the most likely to make a claim while women with children are the least likely to make a claim.
Conclusion
In this tutorial, we have explored the basics of EDA by conducting univariate, bivariate, and multivariate analyses on a dataset. I hope that I was able to clearly illustrate the kinds of issues to tackle, the types of visualizations to create, and the various analyses to do while exploring a dataset. Most important of all, I hope that you have gained some new skills from reading or following along with this article. Thank you very much for sticking around to the end, and if there is anything you need further clarification on, please don’t hesitate to leave a comment. All the best in your data journey!