How to estimate feature importance and perform feature selection with XGBoost in Python

By Deepnote team

•

Updated on November 23, 2023

Techniques for feature importance and selection with XGBoost.

Using ensemble decision tree methods like gradient boosting in XGBoost offers the advantage of automatically estimating feature importance in a predictive model. This article will guide you through estimating feature importance with Python's XGBoost library, plotting these importances, and leveraging them for feature selection.

How to calculate feature importance with the gradient boosting algorithm

In gradient boosting, feature importance is a measure of how valuable each feature is in the construction of the boosted decision trees. It is determined by the extent to which each feature improves the model's performance, weighted by the number of observations affected by the feature's split point. This importance is computed for each feature, allowing for a comparative ranking.

How to manually plot feature importance in Python using XGBoost

XGBoost's trained model includes a feature_importances_ member variable that contains these scores. To visualize the importance, you can use a bar chart. Here’s how you can do it:

from xgboost import XGBClassifier
from matplotlib import pyplot
import numpy as np

# Example: Using the Pima Indians Diabetes dataset
dataset = np.loadtxt('pima-indians-diabetes.csv', delimiter=",")
X, y = dataset[:, 0:8], dataset[:, 8]

model = XGBClassifier()
model.fit(X, y)

pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()

[ 0.089701    0.17109634  0.08139535  0.04651163  0.10465116  0.2026578
  0.1627907   0.14119601]

Using XGBoost's built-in feature importance plot

XGBoost simplifies this process with a built-in plot_importance()function, automatically ordering features by importance:

from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot

# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]

# fit model no training data
model = XGBClassifier()
model.fit(X, y)

# plot feature importance
plot_importance(model)
pyplot.show()

How to use XGBoost feature importance scores for feature selection

Scikit-learn's SelectFromModel class can use these importance scores for feature selection. This involves training a model on the full set of features, then using the importance scores to select a subset of features for the final model. Here's an example:

from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=7)
model.fit(X_train, y_train)

thresholds = np.sort(model.feature_importances_)
for thresh in thresholds:
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Threshold={thresh:.3f}, Features={select_X_train.shape[1]}, Accuracy: {accuracy*100:.2f}%")

Accuracy: 77.95%
Thresh=0.071, n=8, Accuracy: 77.95%
Thresh=0.073, n=7, Accuracy: 76.38%
Thresh=0.084, n=6, Accuracy: 77.56%
Thresh=0.090, n=5, Accuracy: 76.38%
Thresh=0.128, n=4, Accuracy: 76.38%
Thresh=0.160, n=3, Accuracy: 74.80%
Thresh=0.186, n=2, Accuracy: 71.65%
Thresh=0.208, n=1, Accuracy: 63.78%

Summary

This tutorial covered how to calculate, plot, and utilize feature importance scores in XGBoost for Python. Starting from understanding the concept of feature importance in gradient boosting, we demonstrated plotting these importances and using them in feature selection. The practical implementation shows how feature selection can impact model performance, aiding in the decision of balancing model complexity and accuracy.