from lightgbm import LGBMClassifier import matplotlib.pyplot as plt import missingno as msno import numpy as np import pandas as pd import plotly.express as px import plotly.figure_factory as ff import plotly.graph_objects as go from plotly.offline import init_notebook_mode, iplot from plotly.subplots import make_subplots import random import seaborn as sns from sklearn.ensemble import RandomForestClassifier, VotingClassifier from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.feature_selection import SelectFromModel from sklearn.svm import SVC import spacy from xgboost import XGBClassifier import warnings warnings.filterwarnings("ignore")

Introduction.

A passenger airline company has requested an analysis of its user feedback data in a bid to better understand the pros and cons of flying with the company. This is work in progress (January 2024) along with the Premier League project and a couple of others, and like those projects there is likely to be some change and / or more analysis blocks added in the coming weeks. Danka.

The data.

Training data.

Thankfully there are only 300-ish missing features across 103,904 values and they all exist in one column.

If these projects were resume projects then these datasets aren't really doing my data cleaning skills any favours! To be honest, inventing strategies for data cleaning is my favourite part and I have big projects devoted purely to cleaning data, but still, seeing a dataset that doesn't require any cleaning is quite nice sometimes.

df = pd.read_csv("/work/train.csv")

Test data.

We have 25,976 values here which equates to 25% of the training data, so someone's done their homework. There are 83 missing values here in the Arrival Delay in Minutes column so i'll clean those up now.

test = pd.read_csv("/work/test.csv")

Data cleaning and imputation.

Snake_case column labels.

df.columns, test.columns = df.columns.str.replace(' ','_'), test.columns.str.replace(' ','_') df.columns, test.columns = df.columns.str.lower(), test.columns.str.lower()

The means of each 'Arrival Delay in Minutes' column aren't too far off one another. These null values aren't too bothersome and they're among data in a position in the column which would make any imputation method a viable option. I'll use interpolation because i've found that to return the best results across the board in any project i've used it in. To keep things spicy, I will use slightly different interpolation methods for each train & test sets.

df['arrival_delay_in_minutes'].mean(), df['departure_delay_in_minutes'].mean()

test['arrival_delay_in_minutes'].mean(), test['departure_delay_in_minutes'].mean()

Polynomial for the training data, linear for the test data.

df['arrival_delay_in_minutes'] = df['arrival_delay_in_minutes'].interpolate(method='polynomial', order=2) test['arrival_delay_in_minutes'] = test['arrival_delay_in_minutes'].interpolate(method='linear', limit_direction='both')

Concatenating train & test sets for feature engineering.

all_data = pd.concat([df, test], keys=['train', 'test'])

Creating age group bins.

age_bins = pd.cut(all_data.age, bins=7) age_labels = ['0-17', '18-25', '26-35', '36-45', '46-55', '56-65', '66-90'] all_data['age_group'] = pd.cut(all_data.age, bins=7, labels=age_labels)

And flight distance bins.

distance_bins = pd.cut(all_data.flight_distance, bins=10) distance_labels = ['0-500', '500-1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-3500', '3500-4000', '4000-4500', '4500-5000'] all_data['distance_group'] = pd.cut(all_data.flight_distance, bins=10, labels=distance_labels)

Removing column whitespace.

all_data.columns = all_data.columns.str.strip()

Splitting back to train and test.

df, test = all_data.loc['train'], all_data.loc['test']

Dropping the 'unnamed:_0' & 'id' columns from both sets and the 'satisfaction' column from the test set only, as that's the target column for the final model.

df = df.drop(['unnamed:_0', 'id'], axis=1) test = test.drop(['unnamed:_0', 'id', 'satisfaction'], axis=1)

Dataframe head.

Test set head.

Training set statistical description.

• The age column has a mean of 39.37 but a low std.. It's likely that kids & elderly will have flown with this airline so younger & older passengers will exist here, but with such a low deviation the majority of the passengers will be pretty close to this mean age.

• The inflight wifi service is a satisfaction level ranging from 1 to 5, with 0 == n/a. So with a mean of 2.7 it's looking as though slightly more passengers' reviews are largely neutral.

• The departure/arrival_time_convenient column is another satisfaction quota (actually most of them are, that's the project, so i'll go with the 'DRY' method from now on) and we see a mean of 3 which tells us that passengers are slightly more satisfied with the departure / arrival experience than they are with the wifi at least.

• From that column on, the mean values for each column besides 'ease_of_online_booking' are hovering around a value of 3, so satisfaction across the board is more neutral than positive.

Of the categorical columns, 'Female' is the top gender, 'Loyal' customers are the most frequent, 'Business' is the most frequent class & travel type, and 'neutral or dissatisfied' is the top hit for satisfaction.

Analysis.

Box plots for all numeric columns.

The majority of columns are satisfaction values which will rarely stray from the norm and I can see some odd values in the distance and delay columns, but I won't be treating them as outliers as such, just yet. I might trim them off for the final model depending on whether the flight distance equates to a data employee mistakenly imputing a flight to the moon or whatever, but for now i'd say that they're more than likely going to be quite precise representations of actual distance & time values.

We see the volume in the boxplots mostly hovering around the 3 region, with seat comfort, in-flight entertainment, on-board service, in-flight service and legroom service all around 4 (there are a lot of business class features here). Baggage handling volume is also around 4, which is pretty good whether you're travelling as a business class passenger or otherwise.

There's no need for a correlation matrix for any of these data Some mild correlations could be age + delay, with older passengers and passengers with kids experiencing less overall satisfaction due to longer wait times, but I don't need much more of an idea about the data than what can be found through EDA.

Overall satisfaction.

• 43% of the passengers were satisfied with the experience.

• 56.7% stated neutral or dissatisfied.

Neutral / dissatisfied passengers.

The majority counts for neutral / dissatisfied responses are in the gender, customer type, type of travel, class and delay columns.

Taking a deeper look into those metrics it's easy to see in which areas the discrepancies lie. The least satisfied are:

• The Female gender (slight).

• The Disloyal customers (major).

• The personal travellers (major).

• Eco Plus and Eco class (major).

• Almost anybody experiencing a delay time.

Gender.

The data is relatively evenly split between the two genders, with a slight majority of male passengers.

The average passenger age is 0.27 years younger for the Females.

Customer type by gender sum and average age.

Males make up the majority of Loyal customers (+1.5%) and Females make up the majority of Disloyal customers (+18%).

• Of the Loyal customer type, there is a slight majority of both Male and Female passengers responding with 'neutral or dissatisfied'.

• Of the Disloyal customer type, there is a large majority of both Male and Female passengers responding with 'neutral or dissatisfied'.

The averages below indicate that a higher percentage of loyal customers, regardless of gender, reported being satisfied with the airline service compared to disloyal customers. In comparison, Disloyal Male customers have around a +2% satisfaction rate than Female customers.

Travel type by gender.

Females make up the majority of Business travel passengers (+5%) and Males make up the majority of Personal travel passengers (+4%).

• Of the Business travel type, there is a slight majority of both Male and Female passengers responding with 'satisfied', with more Males than Females responding as such.

• Of the Personal travel type, there is a vast majority of both Male and Female passengers responding with 'neutral or dissatisfied'.

The averages for the above data shows that 59.4% of Male passengers and 57.1% of Female passengers who travelled Business class were satisfied, where only 10.1% of Male passengers and 10.2% of Female passengers in the Personal travel category were satisfied.

Satisfaction levels per airline service.

The highest satisfaction scores per service by flight class are as follows:

Business Class: • In-flight wifi: 2.94. • Ease of online booking: 2.91. • Gate location: 2.98. • Food and drink: 3.32. • Online boarding: 3.72. • Seat comfort: 3.76. • Inflight entertainment: 3.63. • On-board service: 3.68. • Leg room service: 3.64. • Baggage handling: 3.84. • Check-in service: 3.52. • Inflight service: 3.84. • Cleanliness: 3.48. Eco Class: • In-flight wifi: 3.86. • Ease of online booking: 2.61. • Gate location: 2.97. • Food and drink: 3.09. • Online boarding: 2.81. • Seat comfort: 3.14. • Inflight entertainment: 3.10. • On-board service: 3.12. • Leg room service: 3.09. • Baggage handling: 3.45. • Check-in service: 3.12. • Inflight service: 3.46. • Cleanliness: 3.11. Eco Plus Class: • In-flight wifi: 3.88. • Ease of online booking: 2.66. • Gate location: 2.97. • Food and drink: 3.12. • Online boarding: 2.89. • Seat comfort: 3.18. • Inflight entertainment: 3.14. • On-board service: 3.05. • Leg room service: 3.06. • Baggage handling: 3.36. • Check-in service: 3.02. • Inflight service: 3.39. • Cleanliness: 3.13.

Average flight distance per gender.

The Male and Female passengers have equal volume at the min and lower fence. The Male passengers have a larger Q1, median and upper fence volume than the Female passengers but the Female passengers appear to hold quite a bit more more data in the 5000 region.

Average flight distance per gender:

Analysing that information further we can see that most distance groups show a majority Female distribution, but for longer flights (4000-4500 miles and 4500-5000 miles), the percentages shift a little, with Males making up a higher percentage in the 4000-4500 miles group, and Females making up a higher percentage in the 4500-5000 miles group.

So Females make up +1.4% of the gender data but Males fly 1% further in total, with Females purchasing the longest distance flights to the tune of +15%.

• 0-500 miles: 48.87% Males, 51.13% Females. • 500-1000 miles: 49.28% Males, 50.72% Females. • 1000-1500 miles: 49% Males, 51% Females. • 1500-2000 miles: 50.18% Males, 49.82% Females. • 2000-2500 miles: 48.95% Males, 51.05% Females. • 2500-3000 miles: 49.83% Males, 50.17% Females. • 3000-3500 miles: 51.61% Males, 48.39% Females. • 3500-4000 miles: 48.21% Males, 51.79% Females. • 4000-4500 miles: 53.57% Males, 46.43% Females. • 4500-5000 miles: 42.5% Males, 57.5% Females.

Satisfaction by gender per distance group.

The genders experiencing the best satisfaction levels per distance group are:

• 0-500: Females. • 500-1000: Males. • 1000-1500: Males. • 1500-2000: Males. • 2000-2500: Females. • 2500-3000: Males. • 3000-3500: Males. • 3500-4000: Females. • 4000-4500: Males. • 4500-5000: Females.

The satisfaction levels begin to fluctuate noticeably during the longer flights, seemingly those above 3000 miles. There is quite a large difference between the average satisfaction score per gender on the 4500-5000 so let's break that down a bit and try to figure out why the Femanons are generally more satisfied than the Males on the longest journeys.

The averages of each column in the 'df_ge_4500' dataframe - representing flights with a distance greater than or equal to 4500 miles - shows that the Female passengers are a shade over a year older than the Males on average, and are less satisfied than the Males with only the wifi service, the on-board service, the online booking service and the seat comfort. Every other facet of the flight experience was less satisfactory for the Males on the furthest flights.

Average satisfaction for the Female gender, flight distances >= 4500: • Average age: 41.58 • Average inflight wifi service satisfaction score: 2.63 • Average departure/arrival time convenience score: 3.03 • Average ease of online booking score: 2.69 • Average gate location score: 3.02 • Average food and drink satisfaction score: 3.19 • Average online boarding satisfaction score: 3.63 • Average seat comfort satisfaction score: 3.69 • Average inflight entertainment satisfaction score: 3.71 • Average on-board service score: 3.46. • Average legroom service: 4.03 • Average baggage handling service: 3.91. • Average check-in service: 3.58. • Average in-flight service: 3.62. • Average cleanliness: 3.48.

Average satisfaction for the Male gender, flight distances >= 4500: • Average age: 40.24 • Average inflight wifi service satisfaction score: 2.85 • Average departure/arrival time convenience score: 2.79 • Average ease of online booking score: 3.19 • Average gate location score: 2.65 • Average food and drink satisfaction score: 3.08 • Average online boarding satisfaction score: 3.34 • Average seat comfort satisfaction score: 3.48 • Average inflight entertainment satisfaction score: 3.45 • Average on-board service score: 3.56. • Average legroom service: 3.56. • Average baggage handling service: 3.70. • Average check-in service: 3.46. • Average in-flight service: 3.53. • Average cleanliness: 3.43.

Modeling.

df['target_label'] = df['satisfaction'].apply(lambda x: 1 if x == 'satisfied' else 0)

y = df['target_label']

df = df.drop(['age', 'flight_distance', 'target_label', 'satisfaction'], axis=1) test = test.drop(['age', 'flight_distance'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

I will be performing model-specific feature selection with 3-4 select models (hardware reliant) before a voting classifier pipeline for model selection.

RandomForestClassifier selector.

rfc = RandomForestClassifier(random_state=42) rfc_selector = SelectFromModel(estimator=rfc).fit(X_train, y_train) X_train_rfc = rfc_selector.transform(X_train) X_test_rfc = rfc_selector.transform(X_test)

selected_features_rfc = rfc_selector.get_support()

LightGBM Classifier selector.

lgb = LGBMClassifier(objective='binary', random_state=42) lgb_selector = SelectFromModel(estimator=lgb).fit(X_train, y_train) X_train_lgb = lgb_selector.transform(X_train) X_test_lgb = lgb_selector.transform(X_test)

selected_features_lgb = lgb_selector.get_support()

XGBoost selector.

xgb = XGBClassifier(random_state=42) xgb_selector = SelectFromModel(estimator=xgb).fit(X_train, y_train) X_train_xgb = xgb_selector.transform(X_train) X_test_xgb = xgb_selector.transform(X_test)

selected_features_xgb = xgb_selector.get_support()

selected_columns_rfc = X_train.columns[selected_features_rfc] selected_columns_lgb = X_train.columns[selected_features_lgb] selected_columns_xgb = X_train.columns[selected_features_xgb]

LightGBM had what I would call the 'best' output if i'm going to lean on the EDA's insights, so I will use the features selected by that model and convert them to 'X_train_selected'.

X_train_selected = X_train[selected_columns_lgb] X_test_selected = X_test[selected_columns_lgb]

Model pipeline.

There will be no need for hyperparameter tuning, from what i've seen in the EDA there should be no issue with default params on most models, and I'm using Deepnote's hardware so admittedly I feel a bit sorry for them at times like this.

rfc = RandomForestClassifier(random_state=42) svc = SVC(random_state=42) xgb = XGBClassifier(random_state=42) lgb = LGBMClassifier(objective='binary', random_state=42)

voting_clf = VotingClassifier( estimators=[ ('rfc', rfc), ('svc', svc), ('xgb', xgb), ('lgb', lgb), ], voting='hard' ) voting_clf.fit(X_train_selected, y_train)

rfc_clf = voting_clf.estimators_[0] xgb_clf = voting_clf.estimators_[2] lgb_clf = voting_clf.estimators_[3] rfc_importances = rfc_clf.feature_importances_ xgb_importances = xgb_clf.feature_importances_ lgb_importances = lgb_clf.feature_importances_ average_importance = np.mean([rfc_importances, xgb_importances, lgb_importances], axis=0) df_importance = pd.DataFrame({'feature': X_train_selected.columns, 'importance': average_importance}) df_importance_sorted = df_importance.sort_values('importance', ascending=False) df_importance_sorted

And predictions on the test set using LightGBM.

X_unseen = test[selected_columns_lgb] new_predictions = voting_clf.predict(X_unseen)