In this project, we'll be cleaning and analyzing exit surveys to learn how different factors affect employee resignations.
Specifically, we'll explore how the following affect the proportion of resignations due to dissatisfaction:
The following groups are more likely to resign due to dissatisfaction:
For more details, refer to the full analysis below.
We'll be working 2014 exit surveys from the Department of Education, Training and Employment (DETE), and 2013 exit surveys from the Technical and Further Education (TAFE) Institute. Because a data dictionary wasn't provided, we'll use our general knowledge to define the columns.
dete_survey.csv
has a total of 56 columns and 822 rows. Below is a preview of some of the more important columns:
ID
: An id used to identify the participant of the surveySeparationType
: The reason why the person's employment endedCease Date
: The year or month the person's employment endedDETE Start Date
: The year the person began employment with the DETEtafe_survey.csv
has a total of 72 columns and 702 rows. Below is a preview of some of the more important columns:
Record ID
: An id used to identify the participant of the surveyReason for ceasing employment
: The reason why the person's employment endedLengthofServiceOverall. Overall Length of Service at Institute (in years)
: The length of the person's employment in yearsFrom the table above, we can make note of the following:
Business Unit
, Aboriginal
, Torres Strait
, South Sea
, Disability
and NESB
are missing at least ~85% of data.Classification
is missing roughly ~40% of data.bool
types, while most of the remainder are object
types. ID
is the only int
.Cease Date
, DETE Start Date
and Role Start Date
are represented as object
instead of date
types.While most of the summary statistic rows don't apply due to the absence of int
types, we can make note of the following:
DETE Start Date
and Role Start Date
values are Not Stated. This seems to be an alias for null
.Professional Development
up to Health & Safety
have 6 unique values: A, SA, N, D, SD, M. Age
column seems to be broken into brackets, with 61 or older the most frequent value.Aboriginal, Torres Strait
, South Sea
, Disability
and NESB
columns only have 1 unique value and are otherwise mostly null
.null
, we can probably translate that to "No"From the table above, we can make note of the following:
Contributing Factors
columns are missing 40% of data.Record ID
and CESSATION YEAR
, are object
types.Gender
, CurrentAge
and EmploymentType
which are shared with the DETE dataset.We can observe the following:
Contributing Factors
values are "-".Institute Views
and WorkUnitViews
have 6 unique values: CurrentAge
is also split into age brackets, with the most frequency bracket being 56 or olderFrom the data overview, it's clear that this project will involve a lot of data cleaning. Let's read in the data again to properly represent the null values:
We'll need to be relatively merciless about dropping columns. We'll drop the following:
For DETE, we'll drop the object
columns from Professional Development [28]
to Health & Safety [48]
.
As a reminder, these were the columns that captured Strongly Agree-Strongly Disagree information.
For TAFE, we'll drop the object
columns from Main Factor [17]
to Workplace Topic [65]
.
As a reminder, these were the columns that captured Strongly Agree-Strongly Disagree and some Yes-No information.
As you probably noticed, each dataset contains many of the same column information though the column names our different. Here are some of the columns we'd like to keep for our final analysis:
DETE Column Name | TAFE Column Name | Description |
---|---|---|
ID | Record ID | An id used to identify the participant of the survey |
SeperationType | Reason for ceasing employment | The reason why the participant's employment ended |
Cease Date | CESSATION YEAR | The year or month the participant's employment ended |
DETE Start Date | The year the participant began employment with the DETE | |
Employment Status | Employment Type. Employment Type | The employment status of the employee |
Position | Classification. Classification | The position of the employee |
LengthofServiceOverall. Overall Length of Service at Institute (in years) | The length of the person's employment (in years) | |
LengthofServiceCurrent. Length of Service at current workplace (in years) | The length of the person's employment in the role (in years) | |
Age | CurrentAge. Current Age | The age of the participant |
Gender | Gender. What is your Gender? | The gender of the participant |
Recall, we're only interested in analysing employees that have resigned.
We'll use the separationtype
column to filter out all rows without a seperation type of Resignation or similar.
In our DETE data overview, we observed that cease_date
, dete_start_date
and role_start_date
were represented as object
instead of date
types. In the TAFE data overview, cease_date
was represented as a float
.
Let's extract the year from these columns and convert them to date
types.
Now, we'll identify any employees who resigned because they were dissatisfied. Below are the columns we'll use to make this assessment:
TAFE
Contributing Factors. Dissatisfaction
Contributing Factors. Job Dissatisfaction
DETE
job_dissatisfaction
dissatisfaction_with_the_department
employment_conditions
work_life_balance
workload
If an employee indicated that any of the factors above caused them to resign, we'll mark them as dissatisfied
in a new column.
First, let's update Contributing Factors. Dissatisfaction
and Contributing Factors. Job Dissatisfaction
in the TAFE dataset to True
, False
or NaN
values.
We won't need to do this for the DETE dataset as they are already in this format.
The TAFE dataset already has an institute_service
column which tells us the length of time an employee spent in a workplace.
We can compute a similar column for the DETE dataset by subtracting dete_start_date
from cease_date
as follows:
With all of that cleaning done, we can finally combine the two seperate datasets into one
Combining these datasets also meant combining columns which had no overlap. This would lead to a lot of null
values.
So we'll drop any columns with less than 500 non-null
values. The table above highlights each columns non-null
value.
The final step in our data cleaning is to normalize the institute_service
column. Currently, it holds values that are either of the format 1.0
or 1-2
.
We'll instead convert these years into the following categories:
New
: Less than 3 years at a companyExperienced
: 3-6 years at a companyEstablished
: 7-10 years at a companyVeteran
: 11 or more years at a companyThe trend is that employees who spend longer at a company are more likely to resign due to dissatisfaction.
From this, we can observe that:
This highlights that permanent employees are more likely to resign due to dissatisfaction.
This highlights that within the Male
category, a slightly higher proportion resign due to satisfaction.
From our analysis, we've learned that the following groups are more likely to resign due to dissatisfaction: