Task 0
We added requirement.txt with sklearn and graphviz
then we run pip install -r requirement.txt in the terminal
Task 1
Reading the data
Training the baseline classifier
Trying out some different classifiers
tune the hyperparameters
the tuning part shows that the max_depth 7 will get us better accuracy
Final evaluation
Task 2
Use the defined class TreeClassifier as your classifier
also draw a tree
Tune the hyperparameter max_depth to get the best cross-validation performance
max_depth(5) has the most mean accuracy in range(2,11) with score = 0.9123529411764706
evaluate the classifier on the test set
the mean score on test data is 0.8709712722298221
Task 3
LinearRegression
A Linear Regression assumes a linear relationship between the input data and the output data. A linear combination(formula) of all input variables results in output.
Ridge
Ridge regression is a regularized linear regression. It try to shrink coefficients for those input variables that do not contribute much to the prediction task.
Lasso
Lasso is also a regularization model. It also make use of shrinkage. But it uses L1 regularization in contrast to Ridge which uses L2.
DecisionTreeRegressor
Decision tree regression observes features of an object. It trains a model in the structure of a tree to predict data in the future to produce meaningful continuous output.
RandomForestRegressor
Random forest Regressor fits multiple classifying decision trees on various sub-sample of the dataset and than does average of all these samples to increase prediction accuracy.
GradientBoostingRegressor
Gradient Boosting Regressor produces a predictive model from an ensemble of weak predictive models.
MLPRegressor
MLPRegressor implements a multi-layer perceptron (MLP) algorithm that trains using Backpropagation. It don't make use of activation function in the output layer. It also supports multi-output regression, in which a sample can have more than one target.
In order to trade off the Cross-validation score, we have replaced the dummy classifier with 7 classifiers that are more meaningful. Though the score was not significantly higher in all of them as compared to the Dummy classifier Gradient Boosting Regressor has a comparatively higher score than other regressors. Random Forest regressor also performed well followed by Lasso, Ridge and Linear Regression, which performed quite similar but better than the dummy regressor. The decision Tree regressor and MLP Regressor didn't perform better than the Dummy regressor.
Finally, train on the full training set and evaluate for the best regressor
Task 4
Step 1. Implementing the regression model
Step 2. Sanity check
generate such a dataset and plot it
What kind of decision tree would we want to describe this data?
The output or the Target(yy) is not categorical or discrete value so classifier algorithms like DecisionTreeClassifier are not one of our options. the target is continuous values as a result of that regression is our choice and we think DecisionTreeRegressor is a better option.
Train your decision tree regressor algorithm then draw the tree
Select the tree depth according to your common sense
Does the result make sense
What happens if we allow the tree depth to be a large number
Step 3. Predicting apartment prices using decision tree regression
please describe what tree depth
for finding the best result for max_dep we ran the algorithem in the loop and compare the scores
Plot
The base of the plot by increasing depth after depth of 7 the score is improving too, and the graph shows that the 17 is the best depth in rage(2,18)
evaluation score on the test set.
Step 4. Underfitting and overfitting
- draw a plot that shows the evaluation score on the training set and on the test set
- for different values of max_depth, ranging from 0 to 12
- do not use cross-validation this time
for evaluating the regression model we use mean square Error which is calculated by the sum of the square of prediction error which is real output minus predicted output and then divide by the number of data points.
the plot shows the RMSE both training data and testing data for each max_depth.
we can see that for training RMSE is getting lower when the max depth increase but for Testing it gets lower until the depth of 5 and after that, the RMSE is increased more by increasing the depth. so we think the model going to overfit when max_depth is more than 5.