The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
The dataset is used for this competition is synthetic but based on a real dataset (in this case, the actual Titanic data!) and generated using a CTGAN. The statistical properties of this dataset are very similar to the original Titanic dataset, but there's no way to "cheat" by using public labels for predictions. How well does your model perform on truly private test labels?
Your task is to predict whether or not a passenger survived the sinking of the Synthanic (a synthetic, much larger dataset based on the actual Titanic dataset). For each PasengerId row in the test set, you must predict a 0 or 1 value for the Survived target.
Your score is the percentage of passengers you correctly predict. This is known as accuracy.
Packages Installing & Loading
All the Pakages are Preloaded.
install.packages(c('IRkernel','tidyverse','data.table','lightgbm','FNN'), repos='http://cran.rstudio.com/', dependencies=TRUE)
Loading Train & Test
The data has been split into two groups:
training set (train.csv) test set (test.csv) The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Synthanic.
|survival||Survival||0 = No, 1 = Yes|
|pclass||Ticket class||1 = 1st, 2 = 2nd, 3 = 3rd|
|Age||Age in years|
|sibsp||# of siblings / spouses aboard the Titanic|
|parch||# of parents / children aboard the Titanic|
|embarked||Port of Embarkation||C = Cherbourg, Q = Queenstown, S = Southampton|
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
Loading Data and Row bind
Using fread to load test/train csv file, then combining both data set using bind_rows and saprating then using column called is_test
Visualization of Data
Analysing Unique Values
Creating Multiple Features From Ticket
First splitting Ticket column by "/" using saparate then futher spliting them using space, that will create 4 new features.
Ticket_11,Ticket_12,Ticket_21,Ticket_22 => New Features
Features and Non Features Name
Creating Test Database out of Combined Dataset
- Dividing train data into train and validation datset
- Defining X and y for both train and validation
Optimum parmeters selection
1) Learning rate = 0.01 2) bosting = 'gbdt' 3) ojectinve = binary
Creating Lightgbm database for both train and validation
- Training rounds = 5000
- Predicting from validation data
- Creating threshold function for metrics and converting output into binary
- Using three .4,.7,.001 thershold for predicting best performing metric
- Display of best Accuray Achive and best AUC score
- Predicting Test daf frame for competition submission.
Trainning & Prediction
You should submit a csv file with exactly 100,000 rows plus a header row. Your submission will show an error if you have extra columns or extra rows.
The file should have exactly 2 columns:
PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased)