Hypothetical Scenario: Kaggle
"A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.
This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision."
Research Question:
Given information about an individual, can we predict whether or not they are currently looking for a new job?
Data
Note that our data is imbalanced. 33% of the observations have the target = 1.
The data we will be looking at includes 11 different features:
Numerical Columns:
city_development_index
: Float from 0-1training_hours
: Number of hours trained
Categorical:
city
: 123 diff citiesgender
: 4 genders: 'Male', 'Female', 'Other', nan
Ordinal Categorical:
enrolled_university
: 'no_enrollment', 'Full time course', nan, 'Part time course'education_level
: 'Graduate', 'Masters', 'High School', nan, 'Phd', 'Primary School'company_size
: nan, '50-99', '<10', '10000+', '5000-9999', '1000-4999', '10/49','100-500', '500-999'relevent_experience
: 'Has relevent experience', 'No relevent experience'major_discipline
: 'STEM', 'Business Degree', nan, 'Arts', 'Humanities', 'No Major', 'Other'last_new_job
: nan, 'never', 1,2,3,4, >4company_type
: 'Early Stage Startup', 'Funded Startup', 'NGO', 'Other', 'Public Sector', 'Pvt Ltd', nanlast_new_job
: nan, 'never', 1,2,3,4, >4
Numerical/Categorical
experience
: nan, <1, 1 through 20, >20
Target
target
: 1 = "Looking for a Job", 0 = "Not Looking for a Job"
Feature Engineering and Preprocessing Pipeline
For this model, we divide our features into 2 main categories: Numerical and Categorical. This is in order to create a preprocessing pipeline that could correctly deal with missing values in an appropriate way.
- Categorical Preprocessing:
- The first step in this pipeline is to use a
SimpleImputer
to fill in the missing values (np.NaN) with "missing". Although there are many other strategies to use when filling in missing values, there could be underlying reasons in the data collection why an observation has missing data. Therefore, to simply fill in the missing values with the most_frequent of the data would be adding bias from us, the researcher. Without knowing more about why these values are np.nan, we can just fill in the value with "missing" for categorical features. - We then pipe this into a
OneHotEncoder
in order to encode each variable's values as a separate binary column. - Note that after further testing, I've decided to OneHotEncode the ordinal features as well. I did not notice an impact on the model itself when mapping the ordinal features individually to their relative values.
Note that after some EDA, I have noticed that some of the categorical features can be encoded with the OrdinalEncoder. However, this means that I need to create a custom class in order to properly label my features such that the OrdinalEncoder object can encode they way I envision (the default is alphabetically).
- The first step in this pipeline is to use a
- Numerical Preprocessing:
- Instead of using a SimpleImputer, I choose to use an
IterativeImputer
instead. The IterativeImputer tries to mimic R's MICE package (Multivariate Imputation by Chained Equations). I decided to use this because I felt that it would be a better solution than the SimpleImputer because:- Can't fill in the values with "missing" since that affects the pipeline when trying to standardize the columns.
- It is a step in dealing with the problem of increased noise due to imputation
- Next, we use a
StandardScalar
to normalize our data. Due to the fact that neither of my numerical columns have any severe outliers, this is preferred over a RobustScalar. In addition, the normalizing helps bring both features within a similar range. - Finally, due to the fact that many ML algorithms can perform better when the numerical features have a Gaussian distribution, we use a
QuantileTransformer
- Instead of using a SimpleImputer, I choose to use an
Algorithms & Search
For this section, I chose a few algorithms to include in my RandomizedSearchCV
:
1. RandomForestClassifier
1. Why: This was a ML model that I learned in my Intro to ML class. It seemed like a very good contester for this problem because it uses multiple decision trees (that individually tend to overfit on training data) and aggregates their predictions in order to decrease the variance of the model.
2. Hyperparamater Tuning:
- Min Samples Leaf: np.linspace(1,30,4)
- Min samples per leaf is a good hyperparameter to help each decision tree make more generalizable predictions. Note that the default is 1, which can lead to decision trees overfitting to the training data.
- Bootstrap: [True, False]
- When Bootstrap is True, it means that each decision tree is shown a sample of the training data. This is an attempt to create "dumber" decision trees in order to have a better, generalized model. Note that this parameter is dependent on max_samples which we set to a small k (between 5 and 20).
- Class Weight: [None, 'balanced', 'balanced_subsample']
- Since our data is imbalanced, we want to include a variety of class_weights. Note that our SMOTE preprocessing step helps correct for this imbalance already. Therefore, I can expect the ideal class_weight to be None. This is more so a sanity check for me.
- Number of Estimators: np.linspace(50,500,4)
- Here I choose a variety of different number of estimators. These represent the number of decision trees that make up our "RandomForest". The larger the number can potentially lead to decreasing the variance of our model.
LinearSVC
Why: Since this is a binary problem (we are predicting 1s and 0s), SVM can perform better than RandomForests (which is intrinsically situated for multi-class problems). In addition, SVM models can also have an advantage to more sparse data, which, with the OneHotEncoder, could prove to be useful. Finally, I decided to use a LinearSVC because it trains much quicker than the traditional SVC.
- As we can see in the above image, SVC is extremely slow
Hyperparameter Tuning:
- Class Weight:
['balanced', None]
- Similarly to my decision in RFClassification, I include both 'balanced' and None for the class weight of my SVC model
- C:
np.linspace(0.001, 10, 10)
- In an attempt to test different regularization weights, I set C to values between 1 and 100.
- Class Weight:
KNeighborsClassifier
- Why: I included KNN because I felt like it was a fairly different algorithm from the Trees and could be interesting to see how it performs.
- Hyperparameter Tuning:
- Number of Neighbors:
np.linspace(3, 13, 3)
- Tweaking the number of neighbors impacts the amount of neighbors we need to look at before classifying our observation.
- Weights:
['uniform', 'distance']
- Distance: Closer neighbors will have a higher influence on the classification than further neighbors.
- Uniform: Both close and far neighbors have the same weight.
- p:
[1,2]
- p = 1 -> Manhattan Distance
p = 2 -> Euclidean distance
- Number of Neighbors:
ExtraTreesClassifier
- Why: This has a much faster implementation than RandomForestClassifier. This is due to the fact that ExtraTrees chooses its split point threshold's randomly versus RandomForest's more iterative approach.
- Hyperparameter Tuning:
This has the same hyperparameter tuning decisions as my RandomForestClassifier
LogisticRegression
- Why: I included logistic regression in order to see how it performs vs. RandomForest. Logistic Regression, in general, is much quicker to train and much easier to interpret than Random Forests.
- Hyperparameter Tuning:
- Class Weight:
['balanced', None]
- Similarly to my decision in RFClassification, I include both 'balanced' and None for the class weight of my LogisticRegression model
- Solver:
['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
- I included a variety of different solvers as these tend to be data-dependent and it is good to try a variety. Note that since my problem is not a multiclass problem, I am able to include all of these solvers.
- Penalty:
['l1', 'l2', 'elasticnet', 'none']
This represents the penalty used in my loss function.
- Class Weight:
Evaluation Metrics
I decided to use a weighted f1 score as my metric since my data is imbalanced. It is equally important for this business to correctly predict those looking for work (Precision) while maintaining a high Recall Rate (we aren't missing potential job-seeking individuals). This priority is due to a focus on resource allocation. If a business wants to reach out to individuals that are currently looking for jobs, we want to use a model that efficiently allocates their resources (recruiters) by avoiding reaching out to False Positives and making sure we don't miss any True Positives.
RandomizedSearchCV Results:
Best Model:
RandomForestClassifier(bootstrap=False, # No bootstrapping -> Each tree sees all data
class_weight='balanced', # Good because our data is imbalanced
min_samples_leaf=4, # Increasing can improve generalizability if we are overfitting
n_estimators=216, # Plenty of trees to train
n_jobs=-1)
F1_Weighted Score: 0.798
Commentary:
- Class_weight being 'balanced_subsample' can be changed to 'balanced' because we are not bootstrapping. In addition, we can remove max_samples from the hyperparamter list.
- We notice that bootstrapping let to a worse cross-validated weighted-f1 score.
- In addition, our min_samples_leaf was 4, which helps weaken the individual decision trees, but increase model generalizability (decreases variance)
- The class_weight being balanced which makes sense because our data is imbalanced and the model needs to add weights to the different labels. This can lead to a higher recall score and a lower precision score.
Ensemble Learning
In order to further improve our model, we can look into ensemble learning as a way to improve our metric (f1_weighted). Note that we will be primarily looking at: 1. VotingClassifier() - This is the simplest of the three as it simply combines multiple machine learning models and takes the most common prediction ("Hard Voting") or the probability-weighted average of the individual learners ("Soft Voting"). 2. Bagging - Bagging is useful because it tends to reduce the time of each individual model and improve the overall generality. In addition, it can be used with a variety of different models. 3. Boosting - Boosting has shown a lot of promise in improving model metrics the most (at least what I've read online). Therefore, I predict that this may lead to my best evaluation metric.
Voting Classifier:
In this section, I simply combine 3 different models that seemed to perform the best (through multiple iterations of my RandomizedSearchCV).
VotingClassifier Results:
- As we can see, our weighted f1 score is pretty close to how our single RandomForestClassifier performed.
- It doesn't look like there is much significant difference, therefore we can move on to other ensemble techniques
Bagging:
In this section, I experiment the use of bagging with my RandomForestClassifier
Bagging Results:
- Similarly to VotingClassifier, the Bagging did not substantially improve the model's performance, nor did it improve the model's predictive variance. In other words, bagging did not lead to more consistent results (aka: smaller standard deviation)
Boosting
In this section, will look at implementing a gradient boosting machine, specifically using a GradientBoostingClassifier as my final estimator in my StackingClassifier.
GradientBoostingClassifier
Ensemble Model Results:
VotingClassifier():
- Using RandomForestClassifier, LogisticRegression, and ExtraTreesClassifier, this strategy yielded similar results to my initial RandomForestClassifier.
- Note that I used a 'soft' scoring since every one of my models include probabilities to serve as a metric of confidence in a prediction.
- I decided to not use this method for my final model because it did not perform significantly better than my initial model.
Bagging:
- Similarly to VotingClassifier, this ensemble method did not do significantly better than my simpler model (1 RandomForestClassifier).
- GradientBoostingClassifier():
- This strategy, using a GradientBoostingClassifier as my final_estimator in a StackingClassifier gave me my best f1_weighted score. Therefore, I am going to use this as my final model. Note that the 3 estimators from my VotingClassifier() were used as the first step in my stackingClassifier.
Final Model Selection
I decided to use a StackingClassifier with the 3 estimators from my VotingClassifier and a GradientBoostingClassifier as my final estimator. This decision took into consideration the longer training time and considered that the increase in my weighted f1_score made it worth it.
Specifications of the Final Model:
Our final model included a GradientBoostingClassifier as our final estimator in a StackingClassifier. The way a StackingClassifier works is that it takes in a number of estimators that will output their results as the input of our GradientBoostingClassifier (which trains a default of 100 decision stumps). Now, for the inputs: 1. Estimator 1:
- This model was my initial model that outperformed the other models I tested in my RandomizedSearchCV
- Note that the hyperparameters for the RandomForestClassifier were chosen through a RandomizedSearchCV explained in my Initial Model selection.
RandomForestClassifier(bootstrap=False, # Don't bootstrap (each tree is trained on all data)
class_weight='balanced', # Imbalanced data
min_samples_leaf=4, # Increasing can improve generalizability if we are overfitting
n_estimators=216, # Plenty of trees to try to reduce variability in predictions
n_jobs=-1)
```
2. Estimator 2:
- A LogisticRegression model was used in addition to add a more variety of models. These hyperparameters were chosen based off of rerunning my RandomizedSearchCV until I got different models from the RandomForestClassifier.
```python
LogisticRegression(class_weight='balanced', # Imbalanced data
n_jobs=-1,
solver='sag') # sag gives us a faster convergence on normalized data
- Estimator 3:
- An ExtraTreesClassifier was also used because it I wanted to include at least one model that involved bootstrapping data. In particular, this model trained 50 trees on 20 sample observations from the data. In addition, the default hyperparameter for max_features trains each tree on a subset of the features in the data. Setting min_samples_leaf to 10 can lead to more generalizable results as each tree is cut a bit shorter in order for them to better predict on observations it hasn't seen before. Note that these hyperparameters were chosen based off of rerunning my RandomizedSearchCV until I got different models from the RandomForestClassifier.
ExtraTreesClassifier(class_weight='balanced', # Imbalanced data
max_samples=20, # Train each tree on a sample of 20 observations
min_samples_leaf=10, # Increasing can improve generalizability if we are overfitting
n_estimators=50, # Number of trees to create in our classifier
n_jobs=-1)
All together:
final_estimator = GradientBoostingClassifier() # Default hyperparameters
reg = StackingClassifier(estimators=estimators, # The individual estimators explained above
final_estimator=final_estimator,
n_jobs=-1)
But wait, there's more! In order to fit our StackingClassifier on the data, we needed to first include our preprocessing steps.
This involved: 1. Ranking some ordinal features by their relative probabilities of having the target variable = 1. - Note that this is something I wanted to test to see how it would impact my final result. It is not common practice to use this method. Instead, when encoding ordinal features, we define the order heuristically. For example: - Say we have a feature, Education, with 3 categories: High School, Masters, PhD. The traditional way would be to either OneHotEncode them or encode them with their order of difficulty to acquire (High School -> 1, Masters -> 2, PhD -> 3). My method involved looking at each category and seeing which had the highest probability of having our target=1 (the individual is currently looking for work). With that information, we can rank them in that order, where higher numbers signify that they are more likely to be looking for work than lower numbers. 2. Using OneHotEncoding on both 'city' and 'gender' after imputing "missing" values anywhere the data is not collected. 3. Imputing the median for the numerical values, scaling them, and then changing their distribution into a Gaussian one.
Evaluating the Final Model
Now that we have a final model, we can: 1. Predict on the test set and see what our f1_weighted score is. 2. Look at the confusion matrix of our predictions as well. 3. Look at some feature importance since we are using a Tree-based algorithm.
First, we need to load in our testing data.
1. Train our model on all of the training data and then see how it performs on the test data
Result: As we can see, our model has a weighted f1-score of 0.795 on our testing set, which is around the same as the training set. This is a good sign because it can imply that our model might not have been overfit on the training and that it generalizes on new data just as well as data it was trained on.
2. View the confusion matrix of our predictions and calculate the accuracy, recall and precision scores.
Interpretation:
- Looking at the confusion matrix, we see that our model did fairly well. In particular, our model misclassified only 14% of individuals not looking for work, and 37% of individuals looking for work.
- In particular, our precision score was 62%, which means that "When my model predicted someone to be Looking for Work, it was accurate 62% of the time.". Although this isn't an amazing score, it can still be useful for a variety of potential business cases. In the narrative that we are using this model to help better allocate the time of our recruiters, having a 62% chance that every candidate the recruiter is talking to is looking for a job, can be a huge time saver. This is compared to not using the model and having only a 33% chance of reaching out to someone at random and them currently looking for work.
- In addition, we can look at the recall score. At 63%, this means that "When an individual was looking for work, our model accurately classified them 63% of the time." This is also an important metric for the above business situation. In particular, we would want to minimize our False Negatives, aka: when our model inaccurately predicts someone who is looking for work.
- All together, we can look at the weighted f1 score of 79% because our business case revolves around maximizing both precision and recall. Note that since this is an imbalanced dataset, our f1 score calculates the metric for each label, weighs it proportionally to its relative frequency, and outputs a score that, in this case, is not in-between the recall and precision score.
3. Feature Importance.
For interpretability, it is always important to try to extract feature importance from a model. In other words, what features were deemed the "most important" for our model. For this task, I use sklearn's built-in permutation_importance over 100 iterations
Results:
- Interestingly enough, it looks like the city_development_index is the most impactful when making this prediction. In order to better understand the relationship between city_development_index and the target variable, it is helpful to look at a quick plot:
As we can see, it looks like as the city_development_index increases, people are less likely to be looking for a job. This makes sense as a higher city development index could mean that people are more content with their current location. Note that this assumes that looking for work tends to be a factor of location as well as things related to your current job (job type, company size, etc). In addition, this graph could be interpreted on the flip-side. People in cities with a low development index are much more likely to be seeking a new job.
Besides city_development_index, the city, company_size and major_discipline also had a relatively high impact.
Conclusion:
Summary:
With a goal of implementing a variety of skills learned in my Machine Learning Lab, this project focused on HR data to help identify whether or not someone is currently looking for a job. After extensive EDA, I decided to construct three separate pipelines for preprocessing: one for each type of feature (numerical, categorical, ordinal). Next, although my data was imbalanced, I decided against using oversampling with SMOTE because it didn't lead to a noticeable improvement (as shown with cross validation). Finally, I tried several of ensemble techniques and decided that my final model would be a StackingClassifer with a variety of estimators derived from my RandomizedSearchCV and a GradientBoostingClassifier as my final estimator. Note that I used a weighted f1 score as my metric to compare models with as it equally values precision and recall scores while taking into consideration that the data is imbalanced.
Common Questions:
- Why does any of this matter?
- I am glad you asked! Although this project was primarily as a tool to explore different modeling methods on fairly clean data, we did it in a way focusing on a hypothetical scenario where this data would be used. This is helpful when working on actual business problems, because it is important to iterative over different models and have a consistent and relevant metric that we compare each model with. This metric is relevant to the business use-case.
- Why did I impute categorical variables to be "missing" instead of "most_frequent"?
- I chose to do this because imputing the most_frequent can add bias to our model. In essence, we are assuming that the entry was empty because of a clerical error, which could, in itself, have an impact on our target variable. One could argue that me imputing "missing" could be adding my own bias that these weren't just clerical errors and have an equally negative impact on our model. In the end, it was personal preference.
- Follow up: Why did I not impute missing for numerical as well?
- With numerical data, we need slightly different imputing strategies. This is because we want our end result to be all numerical. Therefore, I chose to use an IterativeImputer which I explain below.
- How did I decide to using my own ordinal encoding? - After noticing that a lot of the categories had some inherent order to them, I was interested to see if I could find a relationship with a given value and an improved chance of signifying whether an observation is "Looking for a Job" (target == 1). I did this by calculating the relative probability that each column (other than 'city', 'gender', and the numerical columns) leads to the target being 1. \begin{equation} P(target = 1 | Xcol = val) \text{ For col in ordinal columns, val in unique values in Xcol} \end{equation}
- Why did I use an IterativeImputer for my numerical data?
- This decision was made primarily because I did not want to have a single rule of only imputing the median. Instead, an IterativeImputer works in the following way:
- Say you have 4 columns ('a', 'b', 'c', 'd') and one column ('d') is missing some values. An iterative imputer will train a new model trying to predict the missing values in 'd' with the values in ('a', 'b', 'c').
- This decision was made primarily because I did not want to have a single rule of only imputing the median. Instead, an IterativeImputer works in the following way:
- Why did I use a QuantileTransformer on my numerical data?
- "Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian or standard probability distribution.". In particular, I included a LogisticRegression model in my RandomizedSearchCV which assumes that the data is normalized.
Why did I use cross_val_score?
- I decided to use cross_val_score because I felt that it was a better route of getting a good sense of how well a model performs in order to properly compare it to other modeling strategies.
Why did I decide to use the StackingClassifier()?
- Simply put, it had a better cross validation score than any other model.
Future Steps
- As Boosting has taken Kaggle competitions by stride, it would be interesting to see more exploration on how boosting can further improve this model. In particular, I am interested to see how XGBoost and CatBoost can be used to solve this problem.
- Further exploration on other feature engineering techniques that could improve predictability.