Pump it Up: Data Mining the Water Table
HOSTED BY DRIVENDATA
Can you predict which water pumps are faulty?
Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? This is an intermediate-level practice competition. Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.
Import Required Modules
Below I collect the tools that I will use to build the model. After initial exploratory modeling I found that
XGBClassifier provided the best performance, measured in terms of model accuracy.
Set Random State
There are a few steps below where random processes require a seed. For reproducibility, I set a default random state below.
Select Columns to Drop from the Model
Provide a list of variables that should be dropped from the model. I have not observed any improvement in model performance from dropping data, measured in terms of accuracy. Training time is obviously improved by dropping columns but there seems to be a small price to pay in terms of accuracy for reducing the number of available features.
Because of the submission format requirements for the competition, it is vital that I retain the index column through out modeling so that I are able to produce predictions that can be validated using the competition's validation data.
Make Test Train Split
For the purposes of model tuning I hold 10% of the data out for local testing.
I experimented with both manual and automated feature selection, hoIver neither approach improved model performance. Initially, I had issues with mixed data types in both the
permit columns. The function below converts all categorical variables to strings to eliminate thoes errors.
I will need to pre-process the data in preparation for classification. Pre-processing is different for categorical and numerical variables. In order to implement different pre-pricessing flows, I must first classify all of the variables as categorical or numerical. The function below separates columns into these two classes and excludes any variables that will be dropped from the model.
Below I build a preprocessing step for the pipeline which handles all data processing.
Categorical Preprocessing Pipeline
The pipeline below executes the following three steps for all of the categorical data.
1. Convert all values in categorical columns to strings. This avoids data type errors in the following steps.
2. Fill all missing values with the string
3. One-hot encode all categorical variables. Because this data contains categorical variables with many possible values, it is possible to encounter values in testing data that was not present in the training data. For this reason, I need to set
ignore so that the encoder will simply ignore unknown values in testing data.
Numerical Preprocessing Pipeline
The pipeline below executes two steps: 1. Imputes missing values in any numerical column with the median value from that column. 2. Scales each variable to have mean zero and standard deviation one.
The column transformer below implements each of the three possible pre-processing behaviors.
1. Apply the categorical pipeline.
2. Apply the numerical pipeline.
3. Drop the specified columns.
The if-then statement below ensures that the drop processor is only implemented if there are columns to drop. This is needed since passing an empty
drop_col list throws an error.
Build Model Pipeline
Below I build the main pipeline which executes two steps. 1. Apply preprocessing to the raw data. 2. Fit a one vs rest classifier to the processed data using an eXtreme Gradient Boosted forest model.
Building Parameter Grid
Below I define a grid of hyper-parameters for the pipeline that will be tested in a grid search below.
Instantiate Grid Search
Below I instantiate a grid search object which will fit the pipeline for every combination of the parameters defined above. Since the competition uses accuracy as it's measure of model quality, I sill evaluate model performance in terms of accuracy. For each parameter combination, the grid search will also execute five-fold cross validation.
In order to maximize performance, I will fit the grid search on the full provided training data set and select the best hyper-parameters based on the results of cross validation. For the purposes of local model evaluation, I will then refit the best model on the local training data and use the local testing data to produce a confusion matrix.
Fit Grid Search
Below I fit the grid search on the full training set and select the best model hyper-parameters. This step takes an Extremely long time to run.
Display Results of Grid Search
Below I display the results of the grid search. I pay particular attention to
std_test_score which will become larger if the model is over-fit.
Predict on Validation Data
Below I import the testing data provided by the competition. To maximize performance I refit the model on the full training data set. Predictions are formatted and saved to CSV for submission.
The two most promising directions for further work seem to be: 1. Integrating re-sampling into the pipeline to improve accuracy on the 'functional needs repair' class. 2. Implementing hierarchical models or stacked models.