This is a short introductory tutorial for beginners wanting to get started with Kaggle. I'll be using the dataset from Kaggle's
Tabular Playground Feb 2021 competition. Instead of passively reading this notebook, I'd recommend you to
Duplicate it and play around with it in your workspace.
Download the dataset from here and import it into your Deepnote workspace.
If you are new to Deepnote check out this quick Getting-started guide
info() on a dataframe prints a short description of it
As you can see above, there are total of 25 columns and the breakdown is as follows:
cat9 => 10 categorical columns (values that can categorized)
cont13 => 14 continuous columns (numerical values)
target => 1 target column
sample_submission.csv tells us what the submission date for this competition should look like. As you can see we need to predict a single numerical target value.
For more details, be sure to read about the evaluation criteria for this competition.
Encoding non-numerical data
Now, computers cannot really make sense of categorical features like "Cats" and "Dogs". For them, they are just strings. Therefore, we need to encode the categorical features to make them useful.
This encoding can be performed using Sklearn's
Establish performance baseline
We need a baseline to assess our trained model's performance. For this, we will train a naive model that just guesses the target value. If our trained model performs worse than the naive model then something is wrong with our implementation.
For the sake of comparison, let's see how the naive model performs when it set it to predict the
mean value instead of
Now we will try to fit various types of models to the data.
Linear Model: A linear model can only learn linear relationship between the features in the data. If there are any complex non-linear relationships, linear model will not be able to make sense of them.
Non-linear modelssuch as Random Forests: Quite self-explanatory. These types of models can build a non-linear understanding of the data.
RMSE (Root mean squared error) is lower than naive model which implies that our linear model is working fine.
Note: Lower error => Better performance
Random Forests (Non-linear Model)
The error is lower than linear model but not by much. Do we really need a non-linear model then? Probably not. It's a good idea to test with different number of estimators before reaching a final conclusion though.
Now, we need to train our model and predict the values for
test dataset. Then as per Kaggle's requirements we will generate a
submission.csv file for submitting our results.
For the sake of simplicity, I'll be training a linear model but you can train a
RandomForestRegressor if you wish.
Now check your Deepnote workspace for the newly generated
submission.csv file. Download this file and submit it to Kaggle as shown below.
- 1) Click on