Titanic

Using the Titanic dataset from Kaggle. I'm building this notebook after completing chapter 2 of Hands on Machine Learning, which covers how to complete an end-to-end machine learning project.

Titanic tutorial from Alexis Cook

Frame the problem

Objective: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

It seems some groups of people were more likely to survive than others.

Answer the question: "what sorts of people were more likely to survive?" using passenger data (e.g. name, age, gender, socio-economic class, etc.).

Get the data

Datasets

1) train.csv: subset of passengers (n=891) for training.

The value in the column Survived indicate whether the passenger survived or not (survived=1, died=0).

2) test.csv: subset of passengers for testing (n=418).

3) gender_submission.csv: example to show how you should structure your predictions. It predicts that all female passengers survived, and all male passengers died.

Your submission file should contained a PassengerID column (containing the ID of each passenger from test.csv) and a Survived column, where survived=1 and died=0.

# Creata dataframes from CSV files import pandas as pd import numpy as np train = pd.read_csv('datasets/train.csv') test = pd.read_csv('datasets/test.csv') gender_submission = pd.read_csv('datasets/gender_submission.csv') train.head()

test.head()

gender_submission.head()

Examine the data

train.shape

train.info()

# Examine categorical variables train['Sex'].value_counts()

train['Ticket'].value_counts() # Not sure if this is valuable for the model

train['Cabin'].value_counts() # Not sure if this is valuable for the model

train['Embarked'].value_counts() # Might be useful, probably needs one-hot encoding

# Examine numerical variables train.describe()

# See numerical variable distributions train.hist(bins=50, figsize=(20, 15))

Observations on the data:

Many numerical variables are really more categorical. So, we should One-Hot or Ordinal Encoding to transform them before training our model (Ordinal: Pclass, SibSp, Parch).

Fare and Age have very different scales, we can fix this using feature scaling.

Fare and Age are also fairly tail-heavy. Can we transform them to have more bell-shaped distributions?

PassengerId should not be used for training, seems pretty useless.

We do not need to create a test set, one has already been created for us: test.csv

Normally if I were to create a test set, I'd attempt to do some sort of stratified sampling along a single variable to make sure there was an equal distribution of that variable in both train and test datasets (note: attempting to stratify based on multiple variables would be overly complex - StackExchange and Scribbr)

Explore the data to gain insights

Potential ways to conduct exploratory data analysis:

Create boxplot/violinplot/swarmplot of different numerical attributes for different classes of the target variable. This helps identify potential relationships and dependencies that could impact the target.

# Start by setting aside the test set to make sure we're only working w/ training data titanic = train.copy()

# Looking for correlations corr_matrix = titanic.corr() corr_matrix['Survived'].sort_values(ascending=False)

Looks like Fare and Pclass are actually highly positively and negatively correlated with Survived, respectively.

These variables are actually quite similar. Fare is the price of the ticket, and Pclass is the class of the ticket (lower number --> higher class). Does this mean the richest passengers were more likely to survive?

# Visualize the correlation between attributes from pandas.plotting import scatter_matrix attributes = ['Survived', 'Fare', 'Parch', 'SibSp', 'Age', 'Pclass'] scatter_matrix(titanic[attributes], figsize=(12, 8))

This visualization really isn't so useful because the target attribute is binary, which makes it quite difficult to tease out any real pattern.

# Barplot to visualize distribution of categorical attribute for different classes of target import seaborn as sns import matplotlib.pyplot as plt def cat_countplot(attribute, tight=False): sns.countplot(data=titanic, x=attribute, hue='Survived') if tight: plt.xticks(rotation=45, ha='right') plt.tight_layout() cat_countplot('Sex')

cat_countplot('Pclass')

cat_countplot('Parch')

cat_countplot('SibSp')

It looks like having either one parent/child on the boat, or one sibling/spouse on the boat makes it more likely for the passenger to survive. Maybe that's because smaller families were easier to fit onto the lifeboats. And if an entire family couldn't fit on the lifeboat, then they didn't want to leave anyone behind (so they all stayed on the boat).

# Boxplot to visualize numerical variables def num_boxplot(attribute): sns.boxplot(data=titanic, x='Survived', y=attribute) num_boxplot('Fare')

num_boxplot('Age')

Attribute combinations

Family = Parch + SibSp

# Create new attribute titanic['Family'] = titanic['Parch'] + titanic['SibSp'] titanic[(titanic['Parch']>0) & (titanic['SibSp']>0)].head()

# Updated correlation matrix corr_matrix = titanic.corr() corr_matrix['Survived'].sort_values(ascending=False)

cat_countplot('Family')

Although the correlation between Survived and Family is not as high as I thought it would be, it still seems like passengers who had families that weren't too big (more than four parents/children/sibling/spouses) were more likely to survive. Is there a strong relationship between Family and other attributes that have high correlation with Survived? Like Fare or Pclass?

corr_matrix['Family'].sort_values(ascending=False)

There is a slight positive correlation between Family and Fare. Meaning passengers with higher fares are more likely to have larger families. And passengers with higher fares are more likely to survive.

Is it possible to extract titles from passenger names? And compare survival rates between different title classes?

titanic['Title'] = titanic['Name'].str.extract(', ([A-Za-z]+)\.', expand=False) titanic['Title'].value_counts()

cat_countplot('Title', tight=True)

We know that women were more likely to survive than men. However, it's interesting to look at the Title value of 'Master'. I'm assuming this is the one category of men who were more likely to survive than not. More likely than not they were wealthier passengers with higher fares/classes.

Prepare the data

# Separate targets from training data titanic = train.drop('Survived', axis=1) titanic_labels = train['Survived'].copy() titanic.columns

# Let's separate numerical and categorical attributes first num_columns = titanic.select_dtypes(include=['number']).columns.tolist() cat_columns = titanic.select_dtypes(exclude=['number']).columns.tolist() titanic_num = titanic[num_columns] titanic_cat = titanic[cat_columns] print(f""" Num colums: {num_columns} Cat columns: {cat_columns}""")

Data cleaning

# Find attributes w missing data titanic.info()

Attributes w/ missing data:

Age (714 non-null) --> can impute these missing values, age is likely an important attribute to include.

Cabin (204 non-null) --> probably best to drop this column entirely, there's too much missing data

Embarked (889 non-null) --> can impute these missing values, there's only 2 missing.

# Impute missing values for numerical data from sklearn.impute import SimpleImputer # Mean and median for Age column are very close, shouldn't make a major difference imputer = SimpleImputer(strategy='median') imputer.fit(titanic_num) # To see the medians calculated for each column imputer.statistics_ # Create transformed dataframe X = imputer.transform(titanic_num) titanic_num_tr = pd.DataFrame(X, columns=titanic_num.columns, index=titanic_num.index)

# Impute missing values for categorical data # Since there are only two missing values for Embarked, let's start w/ mode imputation imputer = SimpleImputer(strategy='most_frequent') imputer.fit(titanic_cat) X = imputer.transform(titanic_cat) titanic_cat_tr = pd.DataFrame(X, columns=titanic_cat.columns, index=titanic_cat.index)

# Encoding for categorical attributes # Can more easily accomplish this with pd.get_dummies() instead of OneHotEncoder dummy_variables = pd.get_dummies(titanic_cat_tr[['Sex', 'Embarked']]) titanic_cat_tr = pd.concat([titanic_cat_tr, dummy_variables], axis=1) titanic_cat_tr.head()

Create custom transformers

Transformations I want to accomplish:

Add Family column = SibSp + Parch

Add Title column --> extracted from Name

Remove columns: PassengerId, Name, Ticket, Cabin

# Create custom transformer from sklearn.base import BaseEstimator, TransformerMixin import re # Column indexes for variables def get_col_index(col): index = train.columns.get_loc(col) return index cols = ['PassengerId', 'Survived', 'Name', 'Ticket', 'Cabin', 'SibSp', 'Parch'] indices = [get_col_index(col) for col in cols] passengerid_ix, survived_ix, name_ix, ticket_ix, cabin_ix, sibsp_ix, parch_ix = indices class AttributesAdder(BaseEstimator, TransformerMixin): def __init__(self, add_family=True, add_title=True): self.add_family = add_family self.add_title = add_title def fit(self, X, y=None): return self def transform(self, X, y=None): family = X[:, sibsp_ix] + X[:, parch_ix] title_extract = lambda name: re.search(', ([A-Za-z]+)\.', name).group(1) if re.search(', ([A-Za-z]+)\.', name) else None get_title = np.vectorize(title_extract) title = get_title(X[:, name_ix].astype(str)) if self.add_family: X = np.c_[X, family] if self.add_title: X = np.c_[X, title] remove_col_indices = passengerid_ix, name_ix, ticket_ix, cabin_ix X = np.delete(X, remove_col_indices, axis=1) return X attr_adder = AttributesAdder(add_family=True, add_title=True) attr_adder.transform(train.values)

Transformation pipelines

We need two separate pipelines, one for numerical columns and one for categorical columns.

print(f""" All columns: {titanic.columns} Numerical columns: {num_columns} Categorical columns: {cat_columns} """)

# Let's rewrite a separate custom transformer just for numerical attributes passengerid_ix, pclass_ix, age_ix, sibsp_ix, parch_ix, fare_ix = 0, 1, 2, 3, 4, 5 class NumTransformer(BaseEstimator, TransformerMixin): def __init__(self, add_family=True): self.add_family = add_family def fit(self, X, y=None): return self def transform(self, X, y=None): if self.add_family: family = X[:, sibsp_ix] + X[:, parch_ix] X = np.c_[X, family] X = np.delete(X, passengerid_ix, axis=1) return X

# And another one for categorical attributes name_ix, sex_ix, ticket_ix, cabin_ix, embarked_ix = 0, 1, 2, 3, 4 class TitleAdder(BaseEstimator, TransformerMixin): def __init__(self, add_title=True): self.add_title = add_title def fit(self, X, y=None): return self def transform(self, X, y=None): if self.add_title: title_extract = lambda name: re.search(', ([A-Za-z]+)\.', name).group(1) if re.search(', ([A-Za-z]+)\.', name) else None get_title = np.vectorize(title_extract) title = get_title(X[:, name_ix].astype(str)) X = np.c_[X, title] remove_col_indices = name_ix, ticket_ix, cabin_ix X = np.delete(X, remove_col_indices, axis=1) return X

from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import StandardScaler # Main steps: impute missing values, column transformations, feature scaling # Numerical pipeline num_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('num_transformer', NumTransformer()), ('std_scaler', StandardScaler()) ]) # Categorical pipeline cat_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('title_adder', TitleAdder()), ('one_hot_encoder', OneHotEncoder(handle_unknown='ignore')) ]) # Combine the numerical and categorical pipelines full_pipeline = ColumnTransformer([ ('num', num_pipeline, num_columns), ('cat', cat_pipeline, cat_columns) ]) titanic_prepared = full_pipeline.fit_transform(titanic)

# Get categories from One Hot Encoder cat_encoder = full_pipeline.named_transformers_['cat'] cat_encoder_feature_names = cat_encoder.named_steps['one_hot_encoder'].get_feature_names_out(['Sex', 'Embarked', 'Title']) cat_one_hot_attribs = list(cat_encoder_feature_names) cat_one_hot_attribs

Final columns after full pipeline transformation:

Numerical: Pclass, Age, SibSp, Parch, Fare, Family (extra)

Categorical: Sex, Embarked, Title (extra)

One-Hot encoded: (above)

final_attribs = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Family'] + cat_one_hot_attribs final_attribs

Explore many different models

Different models to try: Logistic Regression, Decision Tree, Random Forest

# Utility function for model evaluation def display_scores(scores): print('Scores:', scores) print('Mean:', scores.mean()) print('Standard Deviation:', scores.std())

# Logistic Regression from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.model_selection import cross_val_score import os import joblib # Directory to store models os.makedirs('models', exist_ok=True) # logreg = LogisticRegression() # logreg.fit(titanic_prepared, titanic_labels) # joblib.dump(logreg, 'models/logreg.joblib') logreg = joblib.load('models/logreg.joblib') titanic_predictions = logreg.predict(titanic_prepared) logreg_accuracy = accuracy_score(titanic_labels, titanic_predictions) logreg_accuracy

# Cross-validation on Logistic Regression # Using accuracy as evaluation metric for now logreg_scores = cross_val_score( logreg, titanic_prepared, titanic_labels, scoring='accuracy', cv=10 ) display_scores(logreg_scores)

Logistic regression:

Training set accuracy score: 83.1%

Mean cross-validation accuracy: 82.6%

Seems to perform very well on validation sets, meaning we're not overfitting.

# Decision Tree from sklearn.tree import DecisionTreeClassifier # tree = DecisionTreeClassifier() # tree.fit(titanic_prepared, titanic_labels) # joblib.dump(tree, 'models/tree.joblib') tree = joblib.load('models/tree.joblib') titanic_predictions = tree.predict(titanic_prepared) accuracy_score(titanic_labels, titanic_predictions)

# Cross-validation on Logistic Regression tree_scores = cross_val_score( tree, titanic_prepared, titanic_labels, scoring='accuracy', cv=10 ) display_scores(tree_scores)

Decision tree classifier:

Training set accuracy score: 98.2%

Mean cross-validation accuracy: 77.7%

We're badly overfitting using a decision tree.

# Random Forest from sklearn.ensemble import RandomForestClassifier # forest = RandomForestClassifier() # forest.fit(titanic_prepared, titanic_labels) # joblib.dump(forest, 'models/forest.joblib') forest = joblib.load('models/forest.joblib') titanic_predictions = forest.predict(titanic_prepared) accuracy_score(titanic_labels, titanic_predictions)

# Cross validation forest_scores = cross_val_score( forest, titanic_prepared, titanic_labels, scoring='accuracy', cv=10 ) display_scores(forest_scores)

Random forest classifier:

Training set accuracy score: 98.2%

Mean cross-validation accuracy: 80.9%

Random forest is still overfitting, but not as bad as the decision tree, and it gets fairly close to the logistic regression mean cross-validation score of 82.6%.

Fine-tune your models

Let's work with our Random Forest model going forward, this was the most promising. Let's use Randomized Search to perform hyperparameter tuning.

from sklearn.model_selection import RandomizedSearchCV param_grid = { 'n_estimators': np.arange(100, 1000, 100), 'max_depth': [None, 5, 10, 20], 'max_features': ['sqrt', 'log2'], 'bootstrap': [True, False] } forest = RandomForestClassifier() rnd_search = RandomizedSearchCV( forest, param_distributions=param_grid, n_iter=50, cv=5, scoring='accuracy', verbose=2, random_state=42 ) # rnd_search.fit(titanic_prepared, titanic_labels) # joblib.dump(rnd_search, 'models/forest_rnd_search.joblib') rnd_search = joblib.load('models/forest_rnd_search.joblib') # Get the best parameters and best score best_params = rnd_search.best_params_ best_score = rnd_search.best_score_ # Print the best parameters and best score print("Best Parameters: ", best_params) print("Best Score: ", best_score)

# Feature importances feature_importances = rnd_search.best_estimator_.feature_importances_ sorted(zip(feature_importances, final_attribs), reverse=True)

# Creating new transformer in preparation pipeline to only select the most important attributes def indices_of_top_k(arr, k): return np.argsort(arr)[-k:] class TopFeatureSelector(BaseEstimator, TransformerMixin): def __init__(self, feature_importances, k): self.feature_importances = feature_importances self.k = k def fit(self, X, y=None): self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k) return self def transform(self, X): return X[:, self.feature_indices_]

k = 5 top_k_feature_indices = indices_of_top_k(feature_importances, k) top_k_feature_indices

np.array(final_attribs)[top_k_feature_indices]

# Double check that these are indeed the top k features sorted(zip(feature_importances, final_attribs), reverse=True)[:k]

# Create a new pipeline that runs previously defined prep pipeline and adds top k feature selection # Also, build in prediction prepare_select_and_predict_pipeline = Pipeline([ ('preparation', full_pipeline), ('feature_selection', TopFeatureSelector(feature_importances, k)), ('forest_classifier', RandomForestClassifier(**rnd_search.best_params_)) ]) prepare_select_and_predict_pipeline.fit(titanic, titanic_labels)

# Try the full pipeline some_data = titanic.iloc[:4,] some_labels = titanic_labels[:4] print("Predictions:\t", prepare_select_and_predict_pipeline.predict(some_data)) print("Labels:\t\t", list(some_labels))

# Explore preparation options using GridSearchCV from sklearn.model_selection import GridSearchCV param_grid = [ { 'feature_selection__k': list(range(1, len(feature_importances) + 1)) } ] grid_search_prep = GridSearchCV( prepare_select_and_predict_pipeline, param_grid, cv=5, scoring='accuracy', verbose=2 ) # grid_search_prep.fit(titanic, titanic_labels) # joblib.dump(grid_search_prep, 'models/grid_search_prep.joblib') grid_search_prep = joblib.load('models/grid_search_prep.joblib') grid_search_prep

grid_search_prep.best_params_

rnd_search.best_params_

Final parameters for model

Using Random Forest Classifier:

Best Parameters: {'n_estimators': 100, 'max_features': 'log2', 'max_depth': 5, 'bootstrap': True}

{'feature_selection__k': 8} --> update k=8 in prepare_select_and_predict_pipeline

# Updating full pipeline prepare_select_and_predict_pipeline = Pipeline([ ('preparation', full_pipeline), ('feature_selection', TopFeatureSelector(feature_importances, 8)), ('forest_classifier', RandomForestClassifier(**rnd_search.best_params_)) ]) prepare_select_and_predict_pipeline.fit(titanic, titanic_labels)

Generate predictions on test set

predictions = prepare_select_and_predict_pipeline.predict(test)

# Ensure we have enough predictions print(f""" Shape of test set: {test.shape} Number of predictions: {len(predictions)} """)

# Combine PassengerID and predictions predictions = pd.DataFrame(predictions, columns=['Survived']) test_passenger_ids = test['PassengerId'].copy() submission = pd.concat([test_passenger_ids, predictions], axis=1)

submission.head()

submission.shape

submission.info()

submission.to_csv('submission.csv', index=False)

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Titanic