# Introduction

This is a short introductory tutorial for beginners wanting to get started with Kaggle. I'll be using the dataset from Kaggle's `Tabular Playground Feb 2021`

competition. Instead of passively reading this notebook, I'd recommend you to `Duplicate`

it and play around with it in your workspace.

Download the dataset from here and import it into your Deepnote workspace.

If you are new to Deepnote check out this quick Getting-started guide

# Data Exploration

Calling `info()`

on a dataframe prints a short description of it

As you can see above, there are total of 25 columns and the breakdown is as follows:`cat0`

-`cat9`

=> 10 categorical columns (values that can categorized)`cont0`

-`cont13`

=> 14 continuous columns (numerical values)`target`

=> 1 target column

`sample_submission.csv`

tells us what the submission date for this competition should look like. As you can see we need to predict a single numerical target value.
For more details, be sure to read about the evaluation criteria for this competition.

# Encoding non-numerical data

Now, computers cannot really make sense of categorical features like "Cats" and "Dogs". For them, they are just strings. Therefore, we need to encode the categorical features to make them useful.
This encoding can be performed using Sklearn's `LabelEncoder()`

module.

# Establish performance baseline

We need a baseline to assess our trained model's performance. For this, we will train a naive model that just guesses the target value. If our trained model performs worse than the naive model then something is wrong with our implementation.

For the sake of comparison, let's see how the naive model performs when it set it to predict the `mean`

value instead of `median`

# Train models

Now we will try to fit various types of models to the data.

`Linear Model`

: A linear model can only learn linear relationship between the features in the data. If there are any complex non-linear relationships, linear model will not be able to make sense of them.`Non-linear models`

such as Random Forests: Quite self-explanatory. These types of models can build a non-linear understanding of the data.

## Linear Model

The `RMSE`

(Root mean squared error) is lower than naive model which implies that our linear model is working fine.

Note: Lower error => Better performance

## Random Forests (Non-linear Model)

The error is lower than linear model but not by much. Do we really need a non-linear model then? Probably not. It's a good idea to test with different number of estimators before reaching a final conclusion though.

# Submitting Results

Now, we need to train our model and predict the values for `test`

dataset. Then as per Kaggle's requirements we will generate a `submission.csv`

file for submitting our results.

For the sake of simplicity, I'll be training a linear model but you can train a `RandomForestRegressor`

if you wish.

Now check your Deepnote workspace for the newly generated `submission.csv`

file. Download this file and submit it to Kaggle as shown below.

- 1) Click on
`Submit Predictions`

button.