Machine Learning Analysis on Logistic Regression, Decision Tree, and Random Forests Using the Mobile Price Classification Dataset
Introduction
Background of the Study
Statement of the Problem
Significance of the Study
Conducting the study provides several benefits for the smartphone market. Mainly, this revolves on the primary objective of the study which is determining the appropriate price range of a smartphone. Consumers that intend to purchase or inquire about the prices of a product can definitely benefit from the study. This is because the project develops a system using various machine learning models that identifies the appropriate price class or category of a phone given its specifications. Therefore, this can aid towards considering the affordability and cost-efficiency of products. Given these capabilities, the study also benefits the manufacturers. This is because it can aid in making informed decisions towards placing a smart phone in the market. Specifically, it allows them to estimate a competitive price for a product they are developing.
Mobile Price Classification Dataset
The dataset is entitled Mobile Price Classification which was created by Abhishek Sharma. This was created in 2018, and was made publicly available in Kaggle. It contains information about smartphones and their corresponding specifications. All of which are smartphones that were released during 2017 and the years before it. Each row represents an entry of a single smartphone. There are 3000 samples that were included in the dataset. This is further divided into 2 sets which are the training set, containing 2000 samples, and the test set, consisting of 1000 samples. However, only the training set is considered for the methodologies since the test set does not contain the labels in each row. In addition to this, there are 21 columns that indicate the 20 features of a particular product, and the corresponding price range.
Features
Labels
Mobile Price Classification Dataset Class Distribution
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
Methodology
battery_power int64
blue int64
clock_speed float64
dual_sim int64
fc int64
four_g int64
int_memory int64
m_dep float64
mobile_wt int64
n_cores int64
pc int64
px_height int64
px_width int64
ram int64
sc_h int64
sc_w int64
talk_time int64
three_g int64
touch_screen int64
wifi int64
price_range int64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 battery_power 2000 non-null int64
1 blue 2000 non-null int64
2 clock_speed 2000 non-null float64
3 dual_sim 2000 non-null int64
4 fc 2000 non-null int64
5 four_g 2000 non-null int64
6 int_memory 2000 non-null int64
7 m_dep 2000 non-null float64
8 mobile_wt 2000 non-null int64
9 n_cores 2000 non-null int64
10 pc 2000 non-null int64
11 px_height 2000 non-null int64
12 px_width 2000 non-null int64
13 ram 2000 non-null int64
14 sc_h 2000 non-null int64
15 sc_w 2000 non-null int64
16 talk_time 2000 non-null int64
17 three_g 2000 non-null int64
18 touch_screen 2000 non-null int64
19 wifi 2000 non-null int64
20 price_range 2000 non-null int64
dtypes: float64(2), int64(19)
memory usage: 328.2 KB
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
Exploring the other features
One approach to get the most important features is with the use of an ensemble learning method fundamentally based on decision tree called ExtraTreesClassifier. This model has a property called feature_importances_ that states the score of each feature regarding its correlation to the target variable. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature, also known as the Gini importance.
In general, the proponents also wants to explore the correlation of the different features, aside from the target feature, in order to know which variable are similar to each other. With this, the use of Seaborn Correlation Heatmap is necessary since they show at a glance which variables are correlated, to what degree, and in which direction it correlates.
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
In order to visualize the performance measures of the logistic regression with regards to all the features, the classification report is printed to show the precision, recall, f1-score, and support for every target variable by comparing the predictions of the model to the actual target variables. The test set yielded 63.5% accuracy for the model.
precision recall f1-score support
0 0.8028 0.7550 0.7782 151
1 0.5175 0.5068 0.5121 146
2 0.5130 0.5338 0.5232 148
3 0.7081 0.7355 0.7215 155
accuracy 0.6350 600
macro avg 0.6353 0.6328 0.6337 600
weighted avg 0.6374 0.6350 0.6359 600
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
In order to visualize the performance measures of the logistic regression with regards to the selected features, the classification report is printed to show the precision, recall, f1-score, and support for every target variable by comparing the predictions of the model to the actual target variables.
precision recall f1-score support
0 0.8071 0.7483 0.7766 151
1 0.5139 0.5068 0.5103 146
2 0.5065 0.5270 0.5166 148
3 0.6975 0.7290 0.7129 155
accuracy 0.6300 600
macro avg 0.6313 0.6278 0.6291 600
weighted avg 0.6333 0.6300 0.6312 600
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/model_selection/_search.py:925: UserWarning: One or more of the test scores are non-finite: [0.95857143 0.95857143 0.95857143 0.95857143 0.63642857 0.63642857
0.63642857 0.63642857 nan nan nan nan
0.96357143 0.96357143 0.96357143 0.96357143 0.65714286 0.65714286
0.65714286 0.65714286 nan nan nan nan
0.96642857 0.96642857 0.96642857 0.96642857 0.67857143 0.67857143
0.67857143 0.67857143 nan nan nan nan
0.96428571 0.96428571 0.96428571 0.96428571 0.68928571 0.68928571
0.68928571 0.68928571 nan nan nan nan
0.96428571 0.96428571 0.96428571 0.96428571 0.68857143 0.68857143
0.68857143 0.68857143 nan nan nan nan
0.96285714 0.96285714 0.96285714 0.96285714 0.70071429 0.70071429
0.70071429 0.70071429 nan nan nan nan
0.96285714 0.96285714 0.96285714 0.96285714 0.69285714 0.69285714
0.69285714 0.69285714 nan nan nan nan
0.96285714 0.96285714 0.96285714 0.96285714 0.69428571 0.69428571
0.69428571 0.69428571 nan nan nan nan]
category=UserWarning
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/optimize.py:203: ConvergenceWarning: newton-cg failed to converge. Increase the number of iterations.
"number of iterations.", ConvergenceWarning)
precision recall f1-score support
0 0.9933 0.9801 0.9867 151
1 0.9416 0.9932 0.9667 146
2 0.9928 0.9324 0.9617 148
3 0.9747 0.9935 0.9840 155
accuracy 0.9750 600
macro avg 0.9756 0.9748 0.9748 600
weighted avg 0.9758 0.9750 0.9750 600
First, the researchers will run a baseline model with default settings on the initial dataset.
From the results, it can be seen that the model scores an average of 81.83%. This score will serve as a point of comparison to our model with selected features.
precision recall f1-score support
0 0.9097 0.8675 0.8881 151
1 0.7421 0.8082 0.7738 146
2 0.7517 0.7365 0.7440 148
3 0.8750 0.8581 0.8664 155
accuracy 0.8183 600
macro avg 0.8196 0.8176 0.8181 600
weighted avg 0.8210 0.8183 0.8192 600
The researches will now use the Decision Tree Classifier with only the selected features.
precision recall f1-score support
0 0.89 0.87 0.88 151
1 0.74 0.78 0.76 146
2 0.73 0.70 0.72 148
3 0.86 0.85 0.85 155
accuracy 0.80 600
macro avg 0.80 0.80 0.80 600
weighted avg 0.80 0.80 0.80 600
precision recall f1-score support
0 0.93 0.88 0.90 151
1 0.74 0.79 0.76 146
2 0.74 0.76 0.75 148
3 0.91 0.88 0.89 155
accuracy 0.83 600
macro avg 0.83 0.83 0.83 600
weighted avg 0.83 0.83 0.83 600
precision recall f1-score support
0 0.9351 0.9536 0.9443 151
1 0.8151 0.8151 0.8151 146
2 0.7891 0.7838 0.7864 148
3 0.9216 0.9097 0.9156 155
accuracy 0.8667 600
macro avg 0.8652 0.8655 0.8653 600
weighted avg 0.8664 0.8667 0.8665 600
In this section, the Random Forest model is trained using only the selected features.
precision recall f1-score support
0 0.9477 0.9603 0.9539 151
1 0.8366 0.8767 0.8562 146
2 0.8112 0.7838 0.7973 148
3 0.9139 0.8903 0.9020 155
accuracy 0.8783 600
macro avg 0.8774 0.8778 0.8773 600
weighted avg 0.8783 0.8783 0.8781 600
In order to visualize the performance measures of the Random Forest Classifier with regards to selected features, the classification report is printed to show the precision, recall, f1 score, and support for every target variable by comparing the predictions of the model to the actual target variables. After going through hyperparameter tuning, the model did improve. However, this is not as significant compared to other models. Specifically, there is only a 0.34% improvement from the model that contained the default parameters to the model which used the best set of parameters from the grid search.
precision recall f1-score support
0 0.9139 0.9388 0.9262 147
1 0.8904 0.8387 0.8638 155
2 0.8311 0.8367 0.8339 147
3 0.8903 0.9139 0.9020 151
accuracy 0.8817 600
macro avg 0.8814 0.8820 0.8815 600
weighted avg 0.8816 0.8817 0.8814 600