Employee churn prediction

Analyze employee churn. Find out why employees are leaving the company, and learn to predict who will leave the company.

In the past, most of the focus was on the ‘rates’ such as attrition rate and retention rates. HR Managers compute the previous rates and try to predict future rates using data warehousing tools. These rates present the aggregate impact of churn, but this is the half picture. Another approach can be the focus on individual records in addition to aggregate.

In customer churn, you can predict who and when a customer will stop buying. Employee churn is similar to customer churn. It mainly focuses on the employee rather than the customer. Here, you can predict who, and when an employee will terminate the service. Employee churn is expensive, and incremental improvements will give significant results. It will help us in designing better retention plans and improving employee satisfaction.

What is Employee Churn?

Employee churn can be defined as a leak or departure of an intellectual asset from a company or organization. Alternatively, in simple words, you can say, when employees leave the organization is known as churn. Another definition can be when a member of a population leaves a population, which is known as churn.

In Research, it was found that employee churn will be affected by age, tenure, pay, job satisfaction, salary, working conditions, growth potential, and employee perceptions of fairness. Some other variables such as age, gender, ethnicity, education, and marital status, were essential factors in the prediction of employee churn. In some cases such as the employee with a niche, skills are harder to replace. It affects the ongoing work and productivity of existing employees. Acquiring new employees as a replacement has its costs such as hiring costs and training costs. Also, the new employee will take time to learn skills at a similar level of technical or business expertise knowledge as an older employee. Organizations tackle this problem by applying machine learning techniques to predict employee churn, which helps them in taking necessary actions.

The following points help you to understand, employee and customer churn in a better way:

The business chooses the employee to hire someone while in marketing you don’t get to choose your customers.

Employees will be the face of your company, and collectively, the employees produce everything your company does.

Losing a customer affects revenues and brand image. Acquiring new customers is difficult and costly compared to retaining existing customers. Employee churn is also painful for companies and organizations. It requires time and effort in finding and training a replacement.

Employee churn has unique dynamics compared to customer churn. It helps us in designing better employee retention plans and improving employee satisfaction. Data science algorithms can predict future churn.

Importing modules

import pandas as pd import matplotlib.pyplot as plt # for plotting graphs import seaborn as sns # for plotting graphs plt.style.use('ggplot')

Importing dataset

data = pd.read_csv('/work/HR_comma_sep.csv') data.head()

data.info()

About Dataset

This dataset has 14,999 employees, and 10 attributes(6 integer, 2 float, and 2 objects). No variable column has null/missing values.

Description of each columns:

satisfaction_level: It is employee satisfaction point, which ranges from 0-1. last_evaluation: It is evaluated performance by the employer, which also ranges from 0-1. number_projects: How many numbers of projects assigned to an employee? average_monthly_hours: How many average numbers of hours worked by an employee in a month? time_spent_company: time_spent_company means employee experience. The number of years spent by an employee in the company. work_accident: Whether an employee has had a work accident or not. promotion_last_5years: Whether an employee has had a promotion in the last 5 years or not. Departments: Employee's working department/division. Salary: Salary level of the employee such as low, medium and high. left: Whether the employee has left the company or not.

Data Insights

In the given dataset, we have two types of employee one who stayed and another who left the company. So, you can divide data into two groups and compare their characteristics. Here, you can find the average of both the groups using groupby() and mean() function.

data.groupby('left').mean()

Employees who left the company had low satisfaction level, low promotion rate, low salary, and worked more compare to who stayed in the company.

Data Visualization Employees Left

Let's check how many employees were left?

left_value_count = data['left'].value_counts() left_value_count= left_value_count.rename(index= {0: 'Still working', 1: 'Left the company'})

plt.figure(figsize = (7,3)) left_value_count.plot(kind='bar') plt.ylabel('Number of employees') plt.xticks(rotation = 0)

left_value_count

Here, you can see out of 15,000 approx 3,571 were left, and 11,428 stayed. The no of employee left is 23 % of the total employment.

Number of projects

import seaborn as sns plt.figure(figsize = (7,5)) sns.countplot(x =data['number_project'], color = 'gray') plt.xlabel('Number of Projects') plt.ylabel('Number of Employees')

Most of the employee is doing 3 to 5 projects

Time spent in company

data['time_spend_company'].count

plt.figure(figsize = (7,5)) sns.countplot(data, x= data['time_spend_company'], color = 'gray')

Most of the employee experience between 2-4 years. Also, there is a massive gap between 3 years and 4 years experienced employee.

Subplots using searborn (univariate analysis)

features=['number_project','time_spend_company','Work_accident','left', 'promotion_last_5years','Departments ','salary'] fig=plt.subplots(figsize=(10,15)) for i, j in enumerate(features): plt.subplot(4, 2, i+1) plt.subplots_adjust(hspace = 1.0) sns.countplot(x=j,data = data, color='gray') plt.xticks(rotation=90) plt.title("No. of employee")

You can observe the following points in the above visualization:

Most of the employees are doing 3-5 projects

There is a huge gap between 3 years and 4 years experienced employees

The no of employees left the company is 23% of the total employees

A decidely less number of employee get the promotion in the last 5 years

The sales department has the most number of employees followed by technical and support

Most of the employees are getting salary either medium or low

Comparing all the features against Target variable ( "left" )

fig = plt.subplots(figsize = (10, 15)) for i , j in enumerate(features): plt.subplot(4,2 , i+1) plt.subplots_adjust(hspace= 1.0) sns.countplot(data= data, x= j, hue='left', ) plt.xticks(rotation=90) plt.title("No. of employee")

sns.heatmap(data.corr(), annot = True)

You can observe the following points in the above visualization:

Those employees who have the number of projects more than 5 left the company.

The employee who had done 6 and 7 projects, left the company. Seems like they might be overloaded with work.

The employee with five-years of experience is leaving more compared to others because of no promotions in last 5 years and more than 6 years experience are not leaving because of affection with the company.

Those who got promotion in the last 5 years they didn't leave, i.e., all those left they didn't get the promotion in the previous 5 years.

Data Analysis and Visualization summary

Following features are most influencing a person to leave the company:

Promotions: Employees are far more likely to quit their job if they haven't received a promotion in the last 5 years.

Time with Company: Here, The three-year mark looks like a time to be a crucial point in an employee's career. Most of them quit their job around the three-year mark. Another important point is 6-years point, where the employee is very unlikely to leave.

Number Of Projects: Employee engagement is another critical factor to influence the employee to leave the company. Employees with 3-5 projects are less likely to leave the company. The employee with less and more number of projects are likely to leave.

Salary: Most of the employees that quit among the mid or low salary groups.

Cluster Analysis

Let's find out the groups of employees who left. You can observe that the most important factor for any employee to stay or leave is satisfaction and performance in the company. So let's bunch them in the group of people using cluster analysis.

from sklearn.cluster import KMeans #filter data. only employees who have left the company left_emp = data[['satisfaction_level','last_evaluation']][data.left == 1] #applying KMeans Clustering kmeans = KMeans(n_clusters = 3, random_state= 0).fit(left_emp)

#adding new column "label" and assigning cluster labels left_emp['label'] = kmeans.labels_ plt.scatter(left_emp['satisfaction_level'], left_emp['last_evaluation'], c=left_emp['label'] , cmap = 'Accent') plt.xlabel('Satisfaction level') plt.ylabel('Last evaluation') plt.title('3 clusters of employees who left')

Here, Employee who left the company can be grouped into 3 type of employees:

High Satisfaction and High Evaluation(Shaded by green color in the graph), you can also call them Winners.

Low Satisfaction and High Evaluation(Shaded by blue color(Shaded by green color in the graph), you can also call them Frustrated.

Moderate Satisfaction and moderate Evaluation (Shaded by grey color in the graph), you can also call them 'Bad match'.

Prediction Model

Encoding categorical data

The salary column in the dataset is low medium high.

Lots of machine learning algorithms require numerical input data, so you need to represent categorical columns in a numerical column.

In order to encode this data, I am mapping each value to a number. e.g. Salary column's value can be represented as low:0, medium:1, and high:2.

This process is known as label encoding, and sklearn conveniently will do this for you using LabelEncoder.

data.columns

from sklearn import preprocessing #creating label encoder le = preprocessing.LabelEncoder() #converting string labels into numbers data['salary'] = le.fit_transform(data['salary']) data['Departments '] = le.fit_transform(data['Departments '])

Splitting Train and test data

To understand model performance, dividing the dataset into a training set and a test set.

X=data[['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'promotion_last_5years', 'Departments ', 'salary']] y=data['left']

from sklearn.model_selection import train_test_split #split dataset into training set and test set X_train, X_test , y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

Model Building (Employee churn prediction model)

Using Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifier #creating gradient classifier gb = GradientBoostingClassifier() #training the model using the training sets gb.fit(X_train, y_train) #Predicting the target variable y_pred = gb.predict(X_test)

Evaluating Model Performance

from sklearn import metrics #Accuracy print('Accuracy:', metrics.accuracy_score(y_test, y_pred)) # Model Precision print("Precision:",metrics.precision_score(y_test, y_pred)) # Model Recall print("Recall:",metrics.recall_score(y_test, y_pred))

Conclusion

Well, you got a classification rate of 97%, considered as good accuracy.

Precision: Precision is about being precise, i.e., how precise your model is. In other words, you can say, when a model makes a prediction, how often it is correct. In your prediction case, when your Gradient Boosting model predicted an employee is going to leave, that employee actually left 95% of the time.

Recall: If there is an employee who left present in the test set and your Gradient Boosting model can identify it 92% of the time.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Employee churn prediction