AutoInland Vehicle Insurance Claim Using OverSampling and VotingClassifier
This is a simple starter notebook to get started with the AutoInland Vehicle Insurance Claim Challenge on Zindi.
This notebook covers:
- Loading the data
- Simple EDA and an example of feature enginnering
- Data preprocessing and data wrangling
- Over Sampling
- Voting Classifier
- Making a submission
- Some tips for improving your score
This notebook is from Zindi Competition AutoInland Vehicle Insurance Claim Challenge
Importing libraries
Read files
Some basic EDA
Combine train and test set for easy preprocessing
Distribution of the target variable
Distribution of the Gender column
Number of unique values per categorical column
Filling in missing values
Missing values can be filled using different strategies
Tips:
- Mean
- Max
- Min
- sklearn SimpleImputer
- Others... do more reasearch
Feature Engineering
Alot of features can be extracted from dates
Tips:
- Quarter, Start of Year, month?
- Is it a weekend, weekday?
- Is it a holiday
- Duration between different periods, e.g start and end of a policy
- What features can be derived from the age column
- Be creative 😉
Try different strategies of dealing with categorical variables
Tips:
- One hot encoding
- Label encoding
- Target encoding
- Reduce the number of unique values...
Training and making predictions
Tips:
- Is lgbm the best model for this challenge?
- Parameter tuning
- Grid search, random search, perhaps bayesian search works better...
Over Sampling
Voting Classifier
Making predictions of the test set and creating a submission file
This submission will easily get you into top 20 and if you want to win follow more tips.
More Tips
- Thorough EDA and domain knowledge sourcing
- Re-group Categorical features
- More Feature Engineering
- Ensembling of models
- Cross-validation: Group folds, Stratified...