Titanic
Using the Titanic dataset from Kaggle. I'm building this notebook after completing chapter 2 of Hands on Machine Learning, which covers how to complete an end-to-end machine learning project.
Frame the problem
Objective: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
Get the data
Datasets
1) train.csv: subset of passengers (n=891) for training.
The value in the column Survived indicate whether the passenger survived or not (survived=1, died=0).
2) test.csv: subset of passengers for testing (n=418).
3) gender_submission.csv: example to show how you should structure your predictions. It predicts that all female passengers survived, and all male passengers died.
Your submission file should contained a PassengerID column (containing the ID of each passenger from test.csv) and a Survived column, where survived=1 and died=0.
Examine the data
Observations on the data:
We do not need to create a test set, one has already been created for us: test.csv
Explore the data to gain insights
Potential ways to conduct exploratory data analysis:
Looks like Fare and Pclass are actually highly positively and negatively correlated with Survived, respectively.
This visualization really isn't so useful because the target attribute is binary, which makes it quite difficult to tease out any real pattern.
It looks like having either one parent/child on the boat, or one sibling/spouse on the boat makes it more likely for the passenger to survive. Maybe that's because smaller families were easier to fit onto the lifeboats. And if an entire family couldn't fit on the lifeboat, then they didn't want to leave anyone behind (so they all stayed on the boat).
Attribute combinations
Although the correlation between Survived and Family is not as high as I thought it would be, it still seems like passengers who had families that weren't too big (more than four parents/children/sibling/spouses) were more likely to survive. Is there a strong relationship between Family and other attributes that have high correlation with Survived? Like Fare or Pclass?
There is a slight positive correlation between Family and Fare. Meaning passengers with higher fares are more likely to have larger families. And passengers with higher fares are more likely to survive.
Is it possible to extract titles from passenger names? And compare survival rates between different title classes?
We know that women were more likely to survive than men. However, it's interesting to look at the Title value of 'Master'. I'm assuming this is the one category of men who were more likely to survive than not. More likely than not they were wealthier passengers with higher fares/classes.
Prepare the data
Data cleaning
Attributes w/ missing data:
Create custom transformers
Transformations I want to accomplish:
Transformation pipelines
We need two separate pipelines, one for numerical columns and one for categorical columns.
Final columns after full pipeline transformation:
Explore many different models
Different models to try: Logistic Regression, Decision Tree, Random Forest
Logistic regression:
Seems to perform very well on validation sets, meaning we're not overfitting.
Decision tree classifier:
We're badly overfitting using a decision tree.
Random forest classifier:
Random forest is still overfitting, but not as bad as the decision tree, and it gets fairly close to the logistic regression mean cross-validation score of 82.6%.
Fine-tune your models
Let's work with our Random Forest model going forward, this was the most promising. Let's use Randomized Search to perform hyperparameter tuning.
Final parameters for model
Using Random Forest Classifier: