Cleaning and Analyzing Employee Exit Surveys
In this project, we'll be cleaning and analyzing exit surveys to learn how different factors affect employee resignations.
Approach
Specifically, we'll explore how the following affect the proportion of resignations due to dissatisfaction:
- Length of service at the workplace.
- Employment status at time of resignation.
- Employee gender.
Results
The following groups are more likely to resign due to dissatisfaction:
- Employees who spend longer at a company.
- Permanent employees, in contrast to Temporary or Contract employees.
- Male employees, though there is a marginal difference compared to Female employees.
For more details, refer to the full analysis below.
Data Overview
We'll be working 2014 exit surveys from the Department of Education, Training and Employment (DETE), and 2013 exit surveys from the Technical and Further Education (TAFE) Institute. Because a data dictionary wasn't provided, we'll use our general knowledge to define the columns.
dete_survey.csv
has a total of 56 columns and 822 rows. Below is a preview of some of the more important columns:
ID
: An id used to identify the participant of the surveySeparationType
: The reason why the person's employment endedCease Date
: The year or month the person's employment endedDETE Start Date
: The year the person began employment with the DETE
tafe_survey.csv
has a total of 72 columns and 702 rows. Below is a preview of some of the more important columns:
Record ID
: An id used to identify the participant of the surveyReason for ceasing employment
: The reason why the person's employment endedLengthofServiceOverall. Overall Length of Service at Institute (in years)
: The length of the person's employment in years
DETE overview
From the table above, we can make note of the following:
Business Unit
,Aboriginal
,Torres Strait
,South Sea
,Disability
andNESB
are missing at least ~85% of data.Classification
is missing roughly ~40% of data.- Other columns with missing data are missing between ~1%-10%.
- Roughly ~40% of the columns are
bool
types, while most of the remainder areobject
types.ID
is the onlyint
. Cease Date
,DETE Start Date
andRole Start Date
are represented asobject
instead ofdate
types.- There's many columns here that we won't need for our analysis.
While most of the summary statistic rows don't apply due to the absence of int
types, we can make note of the following:
- The most frequent
DETE Start Date
andRole Start Date
values are Not Stated. This seems to be an alias fornull
. Professional Development
up toHealth & Safety
have 6 unique values: A, SA, N, D, SD, M.- These seem to be aliases for Agree, Strongly Agree, Neutral, Disagree, Strongly Disagree. "M" might be some sort of alias for Not Applicable.
- The most frequent values in all these columns are A or Agree.
Age
column seems to be broken into brackets, with 61 or older the most frequent value.Aboriginal, Torres Strait
,South Sea
,Disability
andNESB
columns only have 1 unique value and are otherwise mostlynull
.- The unique value is Yes.
- This suggests that this field was mostly optional for those doing the survey. Where the value is
null
, we can probably translate that to "No"
TAFE overview
From the table above, we can make note of the following:
Contributing Factors
columns are missing 40% of data.- Many other columns are missing between ~10%-25% of data.
- All of the columns, except for
Record ID
andCESSATION YEAR
, areobject
types. - There's a number of columns like
Gender
,CurrentAge
andEmploymentType
which are shared with the DETE dataset. - There's many columns here that we won't need for our analysis.
We can observe the following:
- The most frequent
Contributing Factors
values are "-". Institute Views
andWorkUnitViews
have 6 unique values:- Agree, Neutral, Strongly Agree, Disagree, Strongly Disagree, Not Applicable.
- The most common value in these columns tend to be Agree
CurrentAge
is also split into age brackets, with the most frequency bracket being 56 or older
Data Cleaning
From the data overview, it's clear that this project will involve a lot of data cleaning. Let's read in the data again to properly represent the null values:
Dropping columns
We'll need to be relatively merciless about dropping columns. We'll drop the following:
- Columns that are inconsistent between the two datasets.
- Columns that don't indicate an employee resigned because they were dissatisfied.
For DETE, we'll drop the object
columns from Professional Development [28]
to Health & Safety [48]
.
As a reminder, these were the columns that captured Strongly Agree-Strongly Disagree information.
For TAFE, we'll drop the object
columns from Main Factor [17]
to Workplace Topic [65]
.
As a reminder, these were the columns that captured Strongly Agree-Strongly Disagree and some Yes-No information.
Cleaning column names
As you probably noticed, each dataset contains many of the same column information though the column names our different. Here are some of the columns we'd like to keep for our final analysis:
DETE Column Name | TAFE Column Name | Description |
---|---|---|
ID | Record ID | An id used to identify the participant of the survey |
SeperationType | Reason for ceasing employment | The reason why the participant's employment ended |
Cease Date | CESSATION YEAR | The year or month the participant's employment ended |
DETE Start Date | The year the participant began employment with the DETE | |
Employment Status | Employment Type. Employment Type | The employment status of the employee |
Position | Classification. Classification | The position of the employee |
LengthofServiceOverall. Overall Length of Service at Institute (in years) | The length of the person's employment (in years) | |
LengthofServiceCurrent. Length of Service at current workplace (in years) | The length of the person's employment in the role (in years) | |
Age | CurrentAge. Current Age | The age of the participant |
Gender | Gender. What is your Gender? | The gender of the participant |
Standardize column names
Filter rows
Recall, we're only interested in analysing employees that have resigned.
We'll use the separationtype
column to filter out all rows without a seperation type of Resignation or similar.
Converting date types
In our DETE data overview, we observed that cease_date
, dete_start_date
and role_start_date
were represented as object
instead of date
types. In the TAFE data overview, cease_date
was represented as a float
.
Let's extract the year from these columns and convert them to date
types.
Identifying dissatisfied employees
Now, we'll identify any employees who resigned because they were dissatisfied. Below are the columns we'll use to make this assessment:
TAFE
Contributing Factors. Dissatisfaction
Contributing Factors. Job Dissatisfaction
DETE
job_dissatisfaction
dissatisfaction_with_the_department
employment_conditions
work_life_balance
workload
If an employee indicated that any of the factors above caused them to resign, we'll mark them as dissatisfied
in a new column.
First, let's update Contributing Factors. Dissatisfaction
and Contributing Factors. Job Dissatisfaction
in the TAFE dataset to True
, False
or NaN
values.
We won't need to do this for the DETE dataset as they are already in this format.
Create an institute service column
The TAFE dataset already has an institute_service
column which tells us the length of time an employee spent in a workplace.
We can compute a similar column for the DETE dataset by subtracting dete_start_date
from cease_date
as follows:
Combining the data
With all of that cleaning done, we can finally combine the two seperate datasets into one
Combining these datasets also meant combining columns which had no overlap. This would lead to a lot of null
values.
So we'll drop any columns with less than 500 non-null
values. The table above highlights each columns non-null
value.
Categorize institute service
The final step in our data cleaning is to normalize the institute_service
column. Currently, it holds values that are either of the format 1.0
or 1-2
.
We'll instead convert these years into the following categories:
New
: Less than 3 years at a companyExperienced
: 3-6 years at a companyEstablished
: 7-10 years at a companyVeteran
: 11 or more years at a company
Data Analysis
How does length of service affect resignation?
- ~43% of resigned Established employees were dissatisfied.
- ~30% of resigned Experienced employees were dissatisfied.
- ~28% of resigned New employees were dissatisfied.
- ~43% of resigned Veteran employees were dissatisfied.
The trend is that employees who spend longer at a company are more likely to resign due to dissatisfaction.
How does employment status affect resignation?
From this, we can observe that:
- A slightly higher proportion of Casual Employees resign due to dissatisfaction that Contract/Casual employees.
- A significantly higher proportion of Permanent Part-Time Employees resign due to dissatisfaction that Temporary Part-Time employees.
- A higher proportion of Permanent Full-Time Employees resign due to dissatisfaction that Temporary Full-Time employees.
This highlights that permanent employees are more likely to resign due to dissatisfaction.
How does gender affect resignation?
This highlights that within the Male
category, a slightly higher proportion resign due to satisfaction.
Results
From our analysis, we've learned that the following groups are more likely to resign due to dissatisfaction:
- Employees who spend longer at a company.
- Permanent employees, in contrast to Temporary or Contract employees.
- Male employees, though there is a marginal difference compared to Female employees.