XGBoost Multiclass Classification of Resume
Availability: The project and all supporting files are also available at https://github.com/davidlevinwork/Resume-Predictor.
Overview
Background
Defining the problem
Job placement organizations are flooded with resumes that are currently being processed by hand. Job applications are either done by applying to a specific opening or by an individual who assesses case-by-case resumes and subjectively decides where to place each individual.
We aim to optimize this process by offering an automated model that will enable a primal classification of resumes to job categories, which will enable, in the future, to match resumes with job openings automatically.
The current model could also be used by complex large companies to quickly review resumes that are sent and divide them into different departments.
Previous attempts
We identified 2 main approaches for attempting to tackle this dataset:
F1 Results of the RF model were 0.84 for train set and 0.53 for test set, indicating a possible overfit of the model.
The spaCy model did not try to classify; however, it did have interesting notions regarding text analysis using advanced NLP models. In addition, the spaCy model used only a small portion (200) of the data set. Their aims are different: 1) help recruiters go threw hundreds of applications within a few minutes, and 2) help them to decide whether they should move to the interview stage or not.
Ideas for improvement
We identified 3 aspects that possibly could allow for significant improvement of previous work:
Preparing and Loading Project Requirements
The Data Set
The dataset contains a collection of Resume Examples taken from livecareer.com. Contains 2400+ resumes in a string as well as PDF format. PDFs stored in the data folder are differentiated into their respective labels as folders.
Inside the CSV:
Acknowledgments: Data were obtained by scrapping individual resume examples from www.livecareer.com website.
Import dataset from GIT repository
Exploratory Data Analysis
Job Categories Frequency
As can be seen, BPO, Automobile, and Agriculture appear significantly less than the rest of the categories. For this reason, we chose to remove them and achieve a more balanced dataset.
Resume Length Frequencies
We can see that 1 resume has zero or close to zero words in it, and it should be discarted.
We can still observe that some resumes are lengthy. However, most of the data set is between 5,000 to 9,000 words.
And while there is some variability, most categories behave in a similar manner.
Text Preprocessing
Case lower > Tokenization > Removing non-alphabetical chars > Stop-words removal > Stemming
Word frequencies by job category
10 most common words in each category - We can observe some words that are frequent in most category like manage, and some that are unique.
spaCy Model
The jobzilla skill dataset is a jsonl file containing different skills. The data set contains labels and patterns: words that are used to describe skills.
Resume text preprocess (again, as spaCy requires)
Removing non-alphabetical chars > Case lower > Lemmatize > Stop-words removal
Extract skills using jobzilla NLP model
Skills frequencies (all categories)
Skills WorldCloud (Category='HR')
spaCy NLP model demonstration
Dependency parsing and visualization using spaCy
XGBoost Model Building
Compiling training and test set
parsing the two columns that will be used as X
Note: As you can note, we have tried several feature combinations as 'X'. This one gave the best results:
Vectorizing the training and test sets
Grid search
Defining hyper-parameters for GridSearch
Grid search scoring is based on accuracy score
Executing this code takes a long time - the next code segment indicates the outcome values.
Training the model
Model predictions
Training Score: 1.00 Test Score: 0.78
Accuracy by category
Confusion matrix
ROC
Micro and macro ROC
ROC per Category
Precision-Recall curve
Averaged
Per category:
Model evaluation - strengths and weaknesses
In our exploration of the predictive model we've developed, we've discovered some fascinating strengths and areas for improvement. Our model was tasked with predicting a range of professions based on given data, and the results were quite enlightening.
Strengths
Our model showcased impressive accuracy in predicting certain professions. The standout was the "CONSTRUCTION" profession, where our model achieved perfect accuracy. This suggests that our model is adept at identifying unique features or patterns that are characteristic of the construction industry.
Other professions where the model performed exceptionally well include "CHEF", "HR", and "TEACHER", with accuracies of 0.958, 0.954, and 0.95, respectively. This high level of accuracy across diverse professions indicates the versatility of our model.
Areas for Improvement
Despite the model's strengths, there were some professions where the model's performance was less than optimal. The "ARTS" profession was the most challenging for our model, with an accuracy of just 0.333. This could be due to a variety of factors, such as a lack of distinctive features or insufficient training data for this class.
Other professions where the model could improve include "APPAREL", "DIGITAL-MEDIA", "BANKING", "CONSULTANT", "FINANCE", "HEALTHCARE", and "SALES". These areas indicate where our model might be struggling to distinguish between overlapping features of different professions.
Insights and Next Steps
Our model's high accuracy in predicting the "CONSTRUCTION" profession suggests that there are distinctive keywords or patterns in the data related to this profession that our model has successfully learned to identify.
On the other hand, the low accuracy for the "ARTS" profession suggests that we may need to revisit our approach for this class. This could involve gathering more training data, refining our features, or exploring different model architectures.
While our model has demonstrated promising results, these insights highlight the complexity of the task and the ongoing refinement required to improve its performance. We're excited about the progress we've made and look forward to continuing to enhance our model's ability to predict a wide range of professions accurately.
Our HR Application Based On Our Model
We've developed a groundbreaking application tailored to HR professionals and job seekers alike, designed to leverage the power of artificial intelligence and machine learning for resume analysis. Our app not only processes resume to identify and highlight key skills, but it also visualizes these skills in a comprehensible, aesthetically pleasing sunburst chart to represent the distribution and variety of skills. The most exciting feature, perhaps, is the app's ability to predict the most suitable profession for the candidate based on the skills extracted from their resume. This is achieved by employing an XGBoost model trained on a vast dataset. For HR professionals, this app is a game-changer, making the process of identifying candidate suitability quicker and more precise. For job seekers, it offers insightful feedback on their resume, indicating their strongest skill areas and suggesting suitable career paths. This application is our way of connecting state-of-the-art AI technology with the everyday needs of HR departments and job applicants.
Upload Page
This is the landing page of our application, where users are prompted to upload their resume in a .pdf or .txt format. The system has been designed to accept files of the most common text formats. After uploading the file, the user clicks on the "Upload" button. Once the file has been successfully uploaded, the user can proceed to the next stage, "Highlight".
This screenshot displays the Upload page. Note the button that allows users to select and upload their resumes.
Highlight Page
After uploading a resume, the user is taken to the "Highlight" page. Here, the resume is processed using NLP (Natural Language Processing) techniques, and the most relevant skills and keywords are highlighted. The highlights are based on the information in the job skill ontology, which contains a broad set of skills that employers may look for. The user can see their original resume with the important skills and words emphasized, giving them an insight into what stands out in their resume.
This screenshot shows the Highlight page.
Visualizations
The Visualization page is where the user can see a graphical representation of the skills extracted from their resume. A sunburst chart is used to show the distribution of the skills. This not only makes it easy to understand the proportion of each type of skill the user has but also gives a quick snapshot of areas the user might need to develop further.
Prediction Page
Finally, the user is taken to the Prediction page. This is where our application uses an XGBoost model to predict the profession most suitable for the user based on the skills extracted from their resume. The predicted profession is then displayed to the user. This can give the user an idea of which job profiles their resume is best suited for, helping them to target their job search more effectively.