👩‍🦱Compensation at IBM👨‍🦱

By David Vasquez L. - May 5th, 2022

This is an exploratory exercise that uses an open source dataset from IBM. The first section, Preparing the Dataset, helps prepare the tools with which we will develop the analysis. Later, Visualization and Initial Analysis is exploratory and will help us understand the composition and characteristics of the dataset. Later, we will look for correlations around the variables. Finally, using Linear regression, we will explore factors defining monthly income at the company.

🌊Preparing the Dataset🌊

In order to develop data analysis of this dataset, let us install the packages that will enable our work.

install.packages("data.table") install.packages("dplyr") install.packages("tidyverse") install.packages("devtools") install.packages("jtools") install.packages("stargazer") install.packages("ggplot2") install.packages("GGally") library(GGally) library(dplyr) library(ggplot2) library(jtools) library(stargazer) options(scipen=999)

install.packages("rstatix") install.packages("ggpubr")

Here, we import the dataset we will analyze.

#Read the IBM dataset Hub1 <- read.csv("IBM.csv")

Let us start exploring its structure and a summary of all the variables it includes.

#Explore the dataset with head(), str() and summ() head(Hub1)

str(Hub1)

We will know explore what variables are included in the dataset, to understand the scope of our analysis.

names(Hub1)

stargazer(Hub1, type = "text", digits = 2)

summary(Hub1)

Based on the initial analysis, the youngest worker in this dataset is 18 years old and the olders is 60, with an age mean of 37. The mean distance from home is +9miles, and the farthest that employees live reaches 29miles, which can represent a challenge if hybrid work is reinstated.

Below, we are exploring the 10 lowest monthly salaries at the company, ordered from lowest to highest.

head(sort(Hub1$MonthlyIncome), 10)

On the other hand, these are the 10 highest monthly salaries at the company, ordered from lowest to highest.

tail(sort(Hub1$MonthlyIncome), 10)

Initial conclusions

This is a company with a high standard deviation in terms of age of its employees, as well as in salary and distance of their home from the office.

There are employees who have just begun working at the company and workers that have reached 40 years of work. There are employees whose first job is beginning at this company, and others who have worked in eight additional ones.

There also seems to be data stemming from satisfaction or well-being surveys. The maximum score seems to be 4, which avoids neutral responses and moves employees to lean on positive or negative rating.

🔬Visualization and Initial Analysis🔬

As I am interested in understanding the age distribution of the employees, we create a histogram with the current data. The mean age is 37 and, seemingly, the age range exists between the ages of 30 and 45.

ggplot(data = Hub1, aes(x=Age)) + geom_histogram(bins = 30) + ggtitle("Age of Employees") + xlab("Age")

I also generate a histogram to understand the employees' monthly income. The distribution in positively skewed, which suggests that the vast majority of employees earns a monthly income below $7000. And, precisely, the mean income is $6502, with a standard deviation above 4 points.

ggplot(data = Hub1, aes(x=MonthlyIncome)) + geom_histogram(bins = 30) + ggtitle("Monthly Income of Employees") + xlab("Monthly Income")

The following scatterplot suggests that younger employees earn less than older ones, with the highest salaries concentrated for employees aged 40 to 60.

ggplot(data = Hub1, aes(x = Age, y = MonthlyIncome/1000)) + geom_point() + xlab("Age") + ylab("Income in thousands")

The following scatterplot evidences that the observations with higher salaries are concentrated in the department of Research and Development. The department of Sales is second in concentration of salaries. This is consistent with the type of business IBM develops.

ggplot(data = Hub1, aes(x = Department, y = MonthlyIncome/1000)) + geom_point() + xlab("Department") + ylab("Income in thousands")

Finally, there is far less attrition in positions with higher salaries.

ggplot(data = Hub1, aes(x = Attrition, y = MonthlyIncome / 1000)) + geom_point() + xlab("Attrition") + ylab("Monthly Income in Thousands")

Note:

It is important to understand the definition of the variables, to develop a more precise analysis. For instante, it is unkown how Attrition has been defined by the company. Therefore, conclusions based on visualizations could be too adventorous.

🎏Search for Correlations🎏

Here, I select numerical variables to explore the correlations among them.

cor(Hub1[, c(1, 4, 5, 7, 8, 10, 11, 12, 13)])

In order to understand how statistically significant these correlations are, and to visualize them, I use the package ggpairs and start analyzing variables that produce correlations with statistical significance.

ggpairs(Hub1[, c(1, 5, 7, 8, 10, 11)])

According to this results, the older an employee is is positively correlated with higher monthly income and with a higher educational level, correlations which are also highly statistically significant.

Then I develop the same analysis including categorical variables such as Attrition and Department.

ggpairs(Hub1[, c(1, 2, 3, 5, 7, 8, 10)])

I am interested in the correlation between Department and Monthly Income. Therefore, the boxplot below includes jitter and illustrates the difference in number of observations per department. R&D has a seemingly more numerous population and, although it presents a concentration of employees with lower salaries, it also reveals a remarkable concentration of employees in the highest salaries of the sample. Whereas HR only counts with outliers in the highest levels of salary. Something that emphasizes the importance of R&D talent for the company.

ggplot(data = Hub1, aes(x = Department, y = MonthlyIncome/1000)) + geom_jitter() + xlab("Department") + ylab("Income in thousands")

The boxplot below also reveals interesting facts about attrition, like the evidence that higher salaries have implied less attrition for IBM. However, this should be a cautious conclusion, as the number of R&D employees is far greater than the HR one.

ggplot(data = Hub1, aes(x = Attrition, y = MonthlyIncome/1000)) + geom_jitter() + xlab("Attrition") + ylab("Income in thousands")

Considering the evidence shown in charts above, I come to the conclusion that job satisfaction -however IBM has measured it- do not hinder attrition. As the boxplot below shows, similar levels of attrition can be observed in different levels of job satisfaction.

ggplot(data = Hub1, aes(x = Attrition, y = JobSatisfaction)) + geom_jitter() + xlab("Attrition") + ylab("Job Satisfaction")

🧪Linear Regression Analysis🧪

At this point, I want to dig deeper into the correlations we found and apply regression analysis between the numerical variables of our data set, those that will influence compensation at the company.

First, there is evidence of a strong positive correlation between age and monthly income, which is statistically significant (p-value less than 0.05). However, the proportion of the variation that is predictable from the independent variable reaches only 25%. We could increase predictability by accounting for additional factors.

model1 <- lm(MonthlyIncome ~ Age, data =Hub1) summ(model1)

The association between monthly income and education is also positive and has statistical significance. However, the proportion of the variation that is predictable from the independent variable reaches only 1%

model2 <- lm(MonthlyIncome ~ Education, data =Hub1) summ(model2)

When analyzing job satisfaction, it is interesting to see that, initially, it is not positively correlated with monthly income, although there is not statisfical significance in this statement. I will explore how these values change when accounting for additional factors.

model3 <- lm(JobSatisfaction ~ MonthlyIncome, data =Hub1) summ(model3)

Now, I want to determine how correlated age is with Monthly Income when controlling for the number of years at the company and the level of education.

model4 <- lm(MonthlyIncome ~ Age + YearsAtCompany + Education + NumCompaniesWorked, data =Hub1) summ(model4)

It is interesting to see, however, that when Age is not considered in the equation, the role of education regains importance. Although not statistically significant, it would seem that education level would be a defining factor in the income of long-tenured employees at IBM. It should be noted, nevertheless, that the R-squared in this new equation is 9 points smaller than before, when Age was also included.

model4 <- lm(MonthlyIncome ~ YearsAtCompany + Education + NumCompaniesWorked, data =Hub1) summ(model4)

🏁Conclusions🏁

For the case of IBM, compensation among its workforce is strategically or unintentionally determined by several factors. Although Age seems to have a role in our analysis, the Number of Years of tenure at the company and the Number of Companies worked for before working at IBM might be the most influential elements defining how much an employee earns.

Future Reseach

An interesting factor to analyze with this dataset would be the factors behind attrition. As seen above, job satisfaction is presumably not a deterrent of attrition at IBM, although monthly income seems to have an influence on it. Because of current events in the labor market, with millions of workers quitting their jobs, attrition promises to be an interesting area of analysis.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}👩‍🦱Compensation at IBM👨‍🦱