XGBoost Multiclass Classification of Resume

Availability: The project and all supporting files are also available at https://github.com/davidlevinwork/Resume-Predictor.

Overview

Background - defining the problem, previous attempts, and ideas for improvement

Exploratory data analysis (EDA) and Pre-processing

spaCy model

Model optimization and training

Model predictions

Model evaluation - strengths and weaknesses

Dynamic demonstration of the model

Background

Defining the problem

Job placement organizations are flooded with resumes that are currently being processed by hand. Job applications are either done by applying to a specific opening or by an individual who assesses case-by-case resumes and subjectively decides where to place each individual.

We aim to optimize this process by offering an automated model that will enable a primal classification of resumes to job categories, which will enable, in the future, to match resumes with job openings automatically.

The current model could also be used by complex large companies to quickly review resumes that are sent and divide them into different departments.

Previous attempts

We identified 2 main approaches for attempting to tackle this dataset:

Supervised: Random-forest (RF) from: https://www.kaggle.com/code/sanchukanirupama/rf-based-multiclass-resume-classifier.

F1 Results of the RF model were 0.84 for train set and 0.53 for test set, indicating a possible overfit of the model.

Unsupervised: Topic Modeling using Latent Dirchlet Allocation (LDA) as a form of clustering from: https://deepnote.com/@abid/spaCy-Resume-Analysis-81ba1e4b-7fa8-45fe-ac7a-0b7bf3da7826

The spaCy model did not try to classify; however, it did have interesting notions regarding text analysis using advanced NLP models. In addition, the spaCy model used only a small portion (200) of the data set. Their aims are different: 1) help recruiters go threw hundreds of applications within a few minutes, and 2) help them to decide whether they should move to the interview stage or not.

Ideas for improvement

We identified 3 aspects that possibly could allow for significant improvement of previous work:

Database - removing classes that are rare or classes that are not in line with the project's goals.

Pre-processing - improving text recognition by utilizing NLP models (nltk, spaCy).

Model - use a more advanced model (XGBoost).

Preparing and Loading Project Requirements

import re import nltk import spacy import gensim import joblib import logging import numpy as np import pandas as pd import seaborn as sns import xgboost as xgb from spacy import displacy from sklearn import metrics from wordcloud import WordCloud import matplotlib.pyplot as plt from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from nltk.stem import WordNetLemmatizer, PorterStemmer from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

The Data Set

The dataset contains a collection of Resume Examples taken from livecareer.com. Contains 2400+ resumes in a string as well as PDF format. PDFs stored in the data folder are differentiated into their respective labels as folders.

Inside the CSV:

ID: Unique identifier and file name for the respective pdf.

Resume_str: Contains the resume text only in string format.

Resume_html: Contains the resume data in HTML format as present while web scrapping.

Category: Category of the job the resume was used to apply.

Acknowledgments: Data were obtained by scrapping individual resume examples from www.livecareer.com website.

Import dataset from GIT repository

url = 'https://raw.githubusercontent.com/davidlevinwork/Resume-Predictor/master/Resume/Resume.csv' df = pd.read_csv(url) df.tail()

Exploratory Data Analysis

Job Categories Frequency

plt.figure(figsize=(10,6)) category_counts = df['Category'].value_counts() sorted_categories = category_counts.index sns.countplot(data=df, y='Category', order=sorted_categories, color='steelblue') plt.title('Job Category Frequency') plt.xlabel('Number of appearances in the data') plt.ylabel('Category') plt.show()

As can be seen, BPO, Automobile, and Agriculture appear significantly less than the rest of the categories. For this reason, we chose to remove them and achieve a more balanced dataset.

df = df[~df['Category'].isin(['BPO', 'AUTOMOBILE', 'AGRICULTURE'])]

Resume Length Frequencies

df['length_of_string'] = df['Resume_str'].apply(lambda x: len(str(x))) plt.figure(figsize=(10, 6)) sns.histplot(data=df, x='length_of_string') plt.title('Histogram of Resume Lengths') plt.xlabel('Length of Resume') plt.ylabel('Frequency') plt.show()

We can see that 1 resume has zero or close to zero words in it, and it should be discarted.

indices_to_drop = df[df['length_of_string'] < 50].index df = df.drop(indices_to_drop, axis=0)

df['length_of_string_post'] = df['Resume_str'].apply(lambda x: len(str(x))) fig, ax = plt.subplots(figsize=(10, 6)) sns.ecdfplot(data=df, x=df['length_of_string_post']/1000,linestyle='-', ax=ax) ax.xaxis.set_ticks(np.arange(0, df['length_of_string_post'].max()/1000 + 1, 1)) ax.yaxis.set_ticks(np.arange(0, 1.1, 0.1)) plt.axhline(0.2, color='grey', linestyle='--') plt.axhline(0.9, color='grey', linestyle='--') plt.axvline(5, color='grey', linestyle='--') plt.axvline(9, color='grey', linestyle='--') ax.set_xlabel('Length of Resume (x1000)') ax.set_ylabel('Cumulative Data') ax.set_title('Empirical Cumulative Distribution Function (ECDF)') plt.show()

We can still observe that some resumes are lengthy. However, most of the data set is between 5,000 to 9,000 words.

And while there is some variability, most categories behave in a similar manner.

Text Preprocessing

Case lower > Tokenization > Removing non-alphabetical chars > Stop-words removal > Stemming

stemmer = PorterStemmer() stop_words = set(stopwords.words('english')) def preprocess_text(txt): txt = txt.lower() txt = re.sub('[^a-zA-Z]', ' ', txt) txt = word_tokenize(txt) txt = [w for w in txt if w not in stop_words] txt = [stemmer.stem(w) for w in txt] return ' '.join(txt) df['Resume'] = df['Resume_str'].apply(preprocess_text)

Word frequencies by job category

10 most common words in each category - We can observe some words that are frequent in most category like manage, and some that are unique.

spaCy Model

The jobzilla skill dataset is a jsonl file containing different skills. The data set contains labels and patterns: words that are used to describe skills.

import requests nlp = spacy.load('en_core_web_sm') ruler = nlp.add_pipe("entity_ruler") # URL to the JSONL file url = "https://raw.githubusercontent.com/davidlevinwork/Resume-Predictor/master/jz_skill_patterns.jsonl" response = requests.get(url) with open('jz_skill_patterns.jsonl', 'w') as out_file: out_file.write(response.text) ruler.from_disk('jz_skill_patterns.jsonl')

Resume text preprocess (again, as spaCy requires)

Removing non-alphabetical chars > Case lower > Lemmatize > Stop-words removal

clean = [] for i in range(df.shape[0]): review = re.sub( '(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?"', " ", df["Resume_str"].iloc[i], ) review = review.lower() review = review.split() lm = WordNetLemmatizer() review = [ lm.lemmatize(word) for word in review if word not in set(stopwords.words("english")) ] review = " ".join(review) clean.append(review) df["Clean_Resume"] = clean

Extract skills using jobzilla NLP model

def get_skills(text): doc = nlp(text) subset = [] for ent in doc.ents: if ent.label_ == "SKILL": subset.append(ent.text) return subset def unique_skills(x): return list(set(x)) df["skills"] = df["Clean_Resume"].str.lower().apply(get_skills) df["skills"] = df["skills"].apply(unique_skills)

Skills frequencies (all categories)

Skills WorldCloud (Category='HR')

spaCy NLP model demonstration

Dependency parsing and visualization using spaCy

XGBoost Model Building

Compiling training and test set

parsing the two columns that will be used as X

label_encoder = LabelEncoder() df['clean'] = df['clean'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x) df['skills'] = df['skills'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x) X=df['Clean_Resume']+""+df['skills'] #X=df['clean']+""+df['skills'] #X=df['clean'] #X=df['skills'] X_train, X_test, Y_train, Y_test = train_test_split(X, df['Category'], test_size=0.2, stratify=df['Category'], random_state=23) Y_train = label_encoder.fit_transform(Y_train) Y_test = label_encoder.transform(Y_test)

Note: As you can note, we have tried several feature combinations as 'X'. This one gave the best results:

Vectorizing the training and test sets

vectorizer = TfidfVectorizer() train_vectorizer = vectorizer.fit_transform(X_train).astype(float) test_vectorizer = vectorizer.transform(X_test).astype(float) train_vectorizer, test_vectorizer

Grid search

Defining hyper-parameters for GridSearch

xgb_param_grid = { 'n_estimators': [int(x) for x in np.linspace(start=100, stop=1500, num=100)], 'max_depth': [int(x) for x in np.linspace(start=3, stop=21, num=2)], 'learning_rate': [0.001, 0.01, 0.1, 0.2], 'subsample': [0.5, 0.7, 1.0], 'colsample_bytree': [0.5, 0.7, 1.0], 'gamma': [0, 0.1, 0.2] } model=xgb.XGBClassifier()

Grid search scoring is based on accuracy score

param_grid = xgb_param_grid grid = GridSearchCV(cv=3, verbose=0, scoring='accuracy', estimator=model, param_grid=param_grid, return_train_score=False) grid_search = grid.fit(train_vectorizer, Y_train) model = grid_search.best_estimator_

Executing this code takes a long time - the next code segment indicates the outcome values.

Training the model

model = xgb.XGBClassifier( colsample_bytree = 1.0, learning_rate = 0.1, max_depth = 9, n_estimators = 500, subsample = 0.7 ) model.fit(train_vectorizer, Y_train)

Model predictions

predictions=model.predict(test_vectorizer)

predictions_proba = model.predict_proba(test_vectorizer) classes = label_encoder.inverse_transform(range(len(model.classes_))) df_predictions = pd.DataFrame(predictions_proba, columns=classes, index=X_test.index)

print("Training Score: {:.2f}".format(model.score(train_vectorizer, Y_train))) print("Test Score: {:.2f}".format(model.score(test_vectorizer, Y_test)))

Training Score: 1.00 Test Score: 0.78

Accuracy by category

Confusion matrix

ROC

Micro and macro ROC

ROC per Category

Precision-Recall curve

from sklearn.metrics import classification_report new_df = pd.DataFrame(label_encoder.inverse_transform(predictions), columns=['True Label']) report = classification_report(df_Y, new_df) print(report)

Averaged

Per category:

Model evaluation - strengths and weaknesses

In our exploration of the predictive model we've developed, we've discovered some fascinating strengths and areas for improvement. Our model was tasked with predicting a range of professions based on given data, and the results were quite enlightening.

Strengths

Our model showcased impressive accuracy in predicting certain professions. The standout was the "CONSTRUCTION" profession, where our model achieved perfect accuracy. This suggests that our model is adept at identifying unique features or patterns that are characteristic of the construction industry.

Other professions where the model performed exceptionally well include "CHEF", "HR", and "TEACHER", with accuracies of 0.958, 0.954, and 0.95, respectively. This high level of accuracy across diverse professions indicates the versatility of our model.

Areas for Improvement

Despite the model's strengths, there were some professions where the model's performance was less than optimal. The "ARTS" profession was the most challenging for our model, with an accuracy of just 0.333. This could be due to a variety of factors, such as a lack of distinctive features or insufficient training data for this class.

Other professions where the model could improve include "APPAREL", "DIGITAL-MEDIA", "BANKING", "CONSULTANT", "FINANCE", "HEALTHCARE", and "SALES". These areas indicate where our model might be struggling to distinguish between overlapping features of different professions.

Insights and Next Steps

Our model's high accuracy in predicting the "CONSTRUCTION" profession suggests that there are distinctive keywords or patterns in the data related to this profession that our model has successfully learned to identify.

On the other hand, the low accuracy for the "ARTS" profession suggests that we may need to revisit our approach for this class. This could involve gathering more training data, refining our features, or exploring different model architectures.

While our model has demonstrated promising results, these insights highlight the complexity of the task and the ongoing refinement required to improve its performance. We're excited about the progress we've made and look forward to continuing to enhance our model's ability to predict a wide range of professions accurately.

Our HR Application Based On Our Model

We've developed a groundbreaking application tailored to HR professionals and job seekers alike, designed to leverage the power of artificial intelligence and machine learning for resume analysis. Our app not only processes resume to identify and highlight key skills, but it also visualizes these skills in a comprehensible, aesthetically pleasing sunburst chart to represent the distribution and variety of skills. The most exciting feature, perhaps, is the app's ability to predict the most suitable profession for the candidate based on the skills extracted from their resume. This is achieved by employing an XGBoost model trained on a vast dataset. For HR professionals, this app is a game-changer, making the process of identifying candidate suitability quicker and more precise. For job seekers, it offers insightful feedback on their resume, indicating their strongest skill areas and suggesting suitable career paths. This application is our way of connecting state-of-the-art AI technology with the everyday needs of HR departments and job applicants.

Upload Page

This is the landing page of our application, where users are prompted to upload their resume in a .pdf or .txt format. The system has been designed to accept files of the most common text formats. After uploading the file, the user clicks on the "Upload" button. Once the file has been successfully uploaded, the user can proceed to the next stage, "Highlight".

This screenshot displays the Upload page. Note the button that allows users to select and upload their resumes.

Highlight Page

After uploading a resume, the user is taken to the "Highlight" page. Here, the resume is processed using NLP (Natural Language Processing) techniques, and the most relevant skills and keywords are highlighted. The highlights are based on the information in the job skill ontology, which contains a broad set of skills that employers may look for. The user can see their original resume with the important skills and words emphasized, giving them an insight into what stands out in their resume.

This screenshot shows the Highlight page.

Visualizations

The Visualization page is where the user can see a graphical representation of the skills extracted from their resume. A sunburst chart is used to show the distribution of the skills. This not only makes it easy to understand the proportion of each type of skill the user has but also gives a quick snapshot of areas the user might need to develop further.

Prediction Page

Finally, the user is taken to the Prediction page. This is where our application uses an XGBoost model to predict the profession most suitable for the user based on the skills extracted from their resume. The predicted profession is then displayed to the user. This can give the user an idea of which job profiles their resume is best suited for, helping them to target their job search more effectively.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}XGBoost Multiclass Classification of Resume

Overview

Background

Defining the problem

Previous attempts

Ideas for improvement

Preparing and Loading Project Requirements

The Data Set

Exploratory Data Analysis

Job Categories Frequency

Resume Length Frequencies

Text Preprocessing

Word frequencies by job category

spaCy Model

Resume text preprocess (again, as spaCy requires)

Extract skills using jobzilla NLP model

spaCy NLP model demonstration

Dependency parsing and visualization using spaCy

XGBoost Model Building

Compiling training and test set

Vectorizing the training and test sets

Grid search

Training the model

Model predictions

Accuracy by category

Confusion matrix

ROC

Precision-Recall curve

Model evaluation - strengths and weaknesses

Strengths

Areas for Improvement

Insights and Next Steps

Our HR Application Based On Our Model

Upload Page

Highlight Page

Visualizations

Prediction Page

XGBoost Multiclass Classification of Resume