Statistical Methods for Data Science - Final Project
Ellen Björkegren, Group EB
1. Introduction
Breast cancer which women all over the world suffer from each year. To diagnose breast cancer doctors look at scans of breast masses. In a study done at Wisconsin University measurements of these masses were taken from such scans, and summarized into a data set together with data on each patient stating whether they had a malignant or benign tumor. This data could then be used to create a classifier to try and classify if a patient has breast cancer without a doctor having to look at the images (1). For a program to do it would reasonably be a lot faster, and if a good enough model is created using this data it might even be more correct. This is what will be attempted in this study, using two different models. The first model will be a Random Forests model, which is a classier that is one of the most common machine learning models to use for classification. The Random Forests model is also similar to the Multisurface Method-Tree model that was used to select the features of the data set. This model will then be compared to a KNN classifier. The purpose of this comparison is to see if the simpler KNN model can compete with the more complex Random Forests classifier. If this is the case the KNN model could be used to speed up the process of classification, since it is a simpler model that should run faster. This would enable patients to get their results and be treated faster, as well as save time in the lab to be spent on running other, maybe more complex, tests. Overall using a classifier for diagnosis would also speed up the process and enable doctors to spend time on other things than to diagnose test results by actually looking at the breast mass scans. The patients would also benefit from a classifier being used, since they would hopefully get their results faster even if there are many other patients being diagnosed around the same time.
2.Data specification (2 pts)
2.1. Describe where you got the data set
The data in this study was collected by performing FNA scans of the breast mass of 569 women. This was done at Wisconsin University in 1995. From these scans 10 different measurements deemed relevant for the classification by using a decision tree method called Multisurface Method-Tree were collected for each nucleus in each mass. Then the average, standard error and worst value for each patient was calculated across al the nuclei in their breast mass, resulting in a total of 30 features. The worst value was calculated by taking the mean of the 3 largest values among the nuclei for each mass. The target variable is a binary and categorical, and describes if the observed breast mass is a benign or malignant tumor. All features are continuous variables (1).
2.2. Put a testing set aside and do not look at it before you test your model.
A testing set is created using 20 % of the total data.
2.3. Split the rest of the data into a training set and a validation set. Do they (i.e., training vs validation) follow the same distribution? Use a Q-Q plot to show their relations.
Then a validation set is created from the remaining training data, by again setting aside 20% .
To see if the training sample and validation sample follow the same distribution we will look at the QQ plots of the three first variables in each sample.
Looking at the QQ-plots of these three variables, they seem to follow the same distribution, since the points approximately follow the 45 degree line.
3. Define a problem (1 pts)
In this study the data is used to try and create a classifier that can predict if a woman has breast cancer faster than a doctor can by looking at each scan. The two models used to do this have different benefits, one is more complex and one is less complex but should therefore run faster. Both these attributes will be valued, since both accuracy and speed of diagnosis are important to the patients. Accuracy is of course important so that the patient can be diagnosed correctly and treated, and time is important so that the patient does not have to wait for the results and can be treated in time to be able to get better. The best model could then be used to be able to diagnose women with breast cancer both fast and correctly, while leaving the doctors time treat them instead of having to diagnose the scans manually one by one.
4. Descriptive analysis (4 pts)
4.1. Show the histogram of some selected variables and describe your observation.
In the histograms above it is shown that the variables describing the same feature in different ways have slightly different distributions and different values. This could be used as a motivation to use all the variables in the classifier. A common theme is that the standard error variables seem to have less variance. Also all the variables are skewed to the left. Some of them such as texture_worst and texture_mean seem to have a shape similar to a bell shaped curve, but are again skewed which might indicate that they are not normal.
4.2. Show the dependence of some selected variables and describe your observation.
To calculate correlation with the target variable it has to be made into a numerical dummy variable. The label B which indicates a benign tumor is therefore relabeled as 1 and the label M indicating malignant is relabeled as 0.
First all variables correlation with the target variable are calculated.
From these variables we will look at the dependence of the 10 variables that have the highest correlation with the dependent variable, which should have the most impact on the classification in the models.
To see the dependency of these variables a heat map is created.
In the heat map darker colors indicate high correlation. As can be seen in the heat map above, the highest correlation is with the dependent variable, which we also saw in the Figure above. There also seems to be some relatively high correlation between some of the explanatory variables. This will be taken into account later in the study, since this could affect the classification.
4.3. Describe the data using its range, sample mean, sample standard deviation and some quantiles. Describe your observation.
Below the 10 variables that has the highest correlation with the target variable are described using their count, mean, standard deviation, minimum and maximum values as well as 25th, 50th, and 75th percentiles.
Most of the variables have lower values around 0 and 1 except for perimeter_worst, area_worst, and area_mean which have higher values in the hundreds and radius_worst which is in the 10s to 30s. For the y-variable the mean value of 0.64 shows that the data is not equally distributed between the labels, since all values are 1 or 0 and if they were equal the mean would be 0.5.
4.4. Choose a visualization method to explore the data set.
First we will look at the target variable.
In the figure above we can see that there are more patients with benign tumors than patients with malignant tumors in the training data. However this probably reflects the data the models will face in real life, since it seems reasonable that more tumors are benign than malignant. Because we want the model to work on real data the number of observations in each class will not be adjusted.
Next we will explore the selected feature variables using pair plots, to see if the data are well separated between the labels.
The data looks well separated for these variables, which is a good sign for classification since then the value for these variables could be used to determine whether the observation belongs to class Malignant (y=0) or class Benign (y=1). The color blue in these plots show the malignant data and orange represents the benign data. It is also interesting to see that the malignant data seems to have the higher values across all feature variables.
4.5. Explain how your analysis relates to the objective of your project, i.e. why are these selected variables important?
Looking at correlation matrix, these variables have a high correlation with the target variable y. If the variables are highly correlated with the y-variable, they will contribute more to the classification and it therefore seems reasonable to study them a bit more. It is also interesting to look at correlation amongst the feature variables, since this could have some effect on the classifier as well.
5. Probability distribution (5 pts)
5.1. Use probability distributions to describe some selected variables. State why they are interesting to look at and describe your observation.
In this part the training data is split into the two labels, to investigate the distribution of the different labels for the different variables. This is done for the 5 variables that had the highest correlation to the target variable, again because it would be reasonable to think that these will have the highest effect on the classification.
Variable 1: Concave points_worst
These two distributions look quite similar, and have somewhat of a bell shaped curve. However their mean value is different, where on average the lowest concave point for patients with a malignant tumor (y=0) seems higher than that for the patients with a benign tumor (y=1).
Variable 2: Concave points_mean
These two distributions are a bit more different, with the distribution for the benign data being more skewed to the left. The mean values are also different, such that the patients with malignant tumors (y=0) seem to have a higher mean value for their concave points than those with a benign tumor (y=1).
Variable 3: Perimeter_worst
The variable describing the perimeter of the breast mass looks has a quite different distribution for the two labels, where the malignant data is more skewed to the left and the benign is more centered. The values for the benignant data are also higher than those for the malignant data. The benign data does have the bell shaped curve typical of a normal distribution.
Variable 4: Radius_worst
For the largest radius values for malignant data (y=0) seems to be centered around some very common values where most observations lie. The benign data is more shaped like a normal distribution, and has lower values than those of the malignant data.
Variable 5: Concavity_mean
For the mean value of the concavity of the breast mass the distributions are very different. For the malignant data (y=0) the distribution is more centered, with a tail to the right, but for the benign data (y=1) the distribution is very skewed to the left.
Overall the variables seem to have somewhat different distributions for the different labels. Some of them looked slightly normally distributed. This is therefore tested using QQ-plots.
Most variables seem to follow a distribution from the Gaussian location-scale family, which can be seen by the straight line of the points, but that they are not at the same angle as the 45 degree line indicating they are not in fact Gaussian. The variables radius_worst and perimeter_worst however do not seem to follow a Gaussian distribution or a Gaussian location-scale family. However even the variables that have points that form a line seem to be quite far from the Gaussian distribution, since the angle from the 45 degree line is quite large.
Since it does not seem like the variables clearly follow a Gaussian distribution, further investigation is done on their possible distributions. For this Fitter is used to compare different tests to see if two distributions are similar. The variables' distributions are compared against some common distributions.
Variable 1: Concave points_worst
The malignant data for the concave points_worst variable seems to best fit the lognormal distribution and the benign data for the concave points_worst variable seems to best fit the normal distribution.
Variable 2: Concave points_mean
Both the malignant data and the benign data for the concave points_worst variable seem to best fit the lognormal distribution.
Variable 3: Perimeter_worst
The malignant data from the perimeter worst data seems to best fit the gamma distribution and the benign data for the same variable seems to best fit the normal distribution.
Variable 4: Radius_worst
Also for the radius worst variable the malignant data seems to best fit the gamma distribution and the benign data the normal.
Variable 5: Concavity_mean
The last variable is concavity mean, and here the malignant data seems to fit the chi2 distribution best and the benign data seems best described by the lognormal distribution.
Although none of these variables are perfectly fit to these distributions they seem reasonable to use to describe the data. Also it is interesting to see that for all variables but one the malignant and benign data follow different distributions. This could be something that the classifiers could later pick up on and use to classify the observations into their separate classes.
5.2. What are the parameters in the distribution? Estimate these parameters.
The parameters for the 4 distributions that seem to fit the data reasonably well are:
Lognormal: standard deviation, location, scale
Normal: mean(location) and standard deviation(scale)
Gamma: alpha(shape), location, beta(scale)
Chi2: degrees of freedom, location, scale
Estimates of these parameters are calculated below based on the estimated distributions for each feature variable.
Variable 1: Concave points_worst
Variable 2: Concave points_mean
Variable 3: Perimeter_worst
Variable 4: Radius_worst
Variable 5: Concavity_mean
5.3. Use hypothesis testing to show some interesting conclusions.
Now that we have analyzed the features in the data that were highly correlated it would be interesting to also focus on the features with lower correlation to see if these features are well separated for the target variable. If this is the case, it would be reasonable to think that they might still contribute to the classification. To do this a hypothesis test is run to compare the mean value for the two different classes separately within the three lowest correlated variables.
To do this we consider two tests, the two sample t-test and the Welchs' test. The difference between these two is that the two sample t-test assumes equal variance for the two samples.
Variance looks about the same, both in the histograms and the difference between the variance calculated in the sample is very small. Therefore a t-test should be reasonable to use for the variable smoothness_se.
Again the variance looks about the same and even though the difference between the variances calculated is somewhat bigger than for the previous variable this could reflect the higher values taken on by this variable. Therefore the two sample t-test will be used.
For the final variable the calculated difference is very small and so a t-test seems reasonable to use although the distributions look a bit different.
Although all feature variables that will be tested seem to have about the same variance we can not say for sure that making this assumption is correct since we have only looked at histograms and calculated variances for the sample we have, while the assumption is for the whole population.
The other two assumptions of a two sample t-test are:
-Independence between samples
-The samples follow a normal distribution
These are pretty strong assumptions. However from looking at the histograms above at least some of the distributions look a bit bell shaped, which is the typical shape of a normal distribution. For the independent assumption this might be reasonable to think since there are different patients in the two samples.
Therefore a two sample t-test should be able to be performed on the three samples, even though the results might not be exact due to that some assumptions might not be fulfilled.
The hypotheses for the test are:
H0: mean_d0 = mean_d1 , H1: mean_d0 ≠ mean_d1
Where mean_d0 is the feature mean for the patients with malignant tumors and mean_d1 is the feature mean of the patients with benign tumors.
All p-values from the tests are high, indicating that the means do not some to be very different. This could mean that these variables will not add very much information to the classifier, especially in the simpler k-means model. Therefore a separate k-means will be created without these features to see if it performs better without them.
6. Predictive analysis (5 pts)
Before starting the predictive analysis the data is split into its y and X-variables.
6.1. KNN
About the model
KNN refers to K-Nearest Neighbor, and is a non-parametric classification model. The hyperparameter k is chosen as the number of neighbor observation to consider for classification, and then the observations are classified to the majority class among its neighbors (2).
What is the mathematical expression of the model?
To calculate the distances between the observations and the centroids the euclidean distance is used:
Then each observation gets assigned to the class where the probability is largest. The probabilities for each class is calculated using the following formula:
Formula source: see source 2.
What are the hyperparameters?
The only hyperparameter for KNN is the number of neighbors k (3).
What are the parameters?
KNN does not have any parameters, but sometimes k is referred to as a parameter instead of a hyperparamater (2).
A short description of how to estimate the parameters?
As stated above there are not really any parameters in KNN, but to choose a value for k one can simply test different values to see which works best. This is done for both the training data and the validation data, to see which k produces the best result for both.
Looking at the resulting plots the model with the low correlated features (KNN1) and the one without them (KNN2) seem to be almost identical. The best result from both plots for test and validation data seem to be when using k=3. Therefore this model will be chosen to compare with the Random Forests model. Since the models seem to give the same results, a check is done to see which is fastest.
Although the first model contains more features and should be more complex and slow, it is shown above to run faster than the model with less features. Therefore the KNN1 model is the one that will be tested against the Random Forests model.
6.2. Random Forests
About the model
Random Forests is a classification model that uses several decision trees to assign each observation to a class. The decision trees are made up of different nodes, where for each node the data is split into smaller and smaller subsets using conditions that are based on the importance of different features. This process stops either when there is no more gain to be made from further splitting or when a stopping criteria is reached. Random Forests creates several such trees and bases the classification on the most common classification for that observation among all trees. To produce the data for all trees bootstrapping is used, which is basically sampling data from the sample to create more data (4).
What is the mathematical expression of the model?
In each decision tree in the Random Forest model, the Gini Importance of a node is calculated. The formula for Gini Importance is:
where ni sub(j) = the importance of node j, w sub(j) = weighted number of samples reaching node j, C sub(j)= the impurity value of node j, left(j) = child node from left split on node j, and right(j) = child node from right split on node j.
This is then used to calculate the importance for each feature:
,where fi sub(i)= the importance of feature i and ni sub(j)= the importance of node j.
In Random Forests several such trees are created with different feature importance values. The average feature importance across all trees is then simply calculated by taking the average of a normalized version of the feature importance value that lies between 1 and 0 (4).
,where RFfi sub(i)= the importance of feature i calculated from all trees in the Random Forest model, normfi sub(ij)= the normalised feature importance for i in tree j and T = total number of trees. The observation will then take the path where the feature value for this variable is highest, traveling through the tree constantly being split from other observations. In the end the groups of observations are classified separately to the different classes based on which class is most common in each subset (4).
Formula source: see source 4.
What are the hyperparameters?
In the sklearn algorithm there are many hyperparameters that can be tuned to create different results. The most important feature is n_estimators, which is the total number of trees used in the model (5). Then there are a number of other hyperparameters that can be tuned, where the most common ones to tune seem to be:
criterion: which error method to use as criterion for passing on an observation. The ones to choose from are Gini (formula given below), entropy and log loss, where gini criterion is most commonly used. max_features: this is the maximum number of features to considered when looking at how to best split the data at a certain node. max_depth: this is the stopping criteria of the tree and if the tree does not stop by itself due to a split not producing a gain, this will be the depth of the tree. min_samples_split: this is the minimum number of data points that have to reach a node for it to split the data. min_samples_leaf: this is the minimum number of observations required to reach a node for it to be considered a leaf node. bootstrap: this hyperparameter regulates if the bootstrapping should be done with (True) or without (False) replacement.
What are the parameters?
In Random Forests the hyperparameters are sometimes called parameters, but they are usually not separated into two distinct groups (5).
A short description of how to estimate the parameters?
To see which hyperparameters give the best result on the training data, we use a modified version of this code from source 6. The idea is to randomly select different parameter values from a range of selected common values, and then run these different combinations. The command best_params can then be used to see which parameter combination gave the best result.
Above the parameters that gave the best result on the training data are shown. However since these parameters are tuned to fit the training data, there might be some overfitting. Therefore a base model is tested against this model using the validation data, to see which model performs best on other data.
From these results the tuned model (tm) gives the exact same accuracy score as the base model (bm), so the base model will be used since it takes less time to run.
6.4. Evaluate their performance. Which one do you prefer and why?
First a check is done to see the difference in time it takes to run the model, and as expected the KNN classifier runs faster than the Random Forests model. Both classifiers are then evaluated.
Looking at the confusion matrices above it can be observed that the KNN model classifies less patients correctly than the Random Forests model does. This can also be seen when looking at the accuracy, which is about 93% for the Random Forests model and only about 78% for the KNN model. The KNN model seems to be particularly bad at classifying malignant tumors, since the confusion matrix shows that only 19 out of the 33 malignant tumors were classified correctly. It also misclassifies more benign tumors than the Random Forests model does. In the Random Forests model almost all masses are classified correctly.
In these models y=0 indicates a malignant tumor and y=1 indicates a benign tumor. Therefore a false positive in this case is when a mass is labeled as malignant (y=1) but really is benign (y=0). Then a false negative would be when a mass is labeled as benign (y=1) but really is malignant (y=0). The latter case might be reasonable to think is more dangerous, since it tells a sick patient that it is healthy. Telling a healthy patient that they are sick is not a good thing either, but then it is very probable that this can error be discovered by running more tests, while the patient labeled as healthy who is really sick might be sent home. Since this error seems more severe it is interesting to look at the recall of the classifiers, where a low value indicates a low proportion of false negatives.
The recall for the Random Forest is higher than that of the KNN model, but the difference is not as great as it was when looking at accuracy.
To further evaluate the models precision and F1 score are also calculated. Precision is a measure similar to recall but minimizes false positives instead of false negatives.
These numbers look very similar to the accuracy score seen previously, and once again the Random Forests model outperforms the KNN.
0Finally the F1 score is calculated, which takes both precision and recall into consideration. The KNN performs better in the F1 score than for accuracy and precision, but the Random Forests model still performs better.
Therefore the best model across all scores seems to be the Random Forests model. Although the KNN is faster, it performs quite a lot worse than the Random Forests model and since these are medical tests being performed accuracy seems more important than running time.
6.5. Run the algorithm you prefer on the test data set and draw a conclusion.
When running the Random Forests classifier on the test data it performs very well, with an accuracy of about 93%. Only 5 masses are misclassified as benign and 3 as malignant. Recall is slightly higher than precision, and therefore F1 is in between.
Although the numbers above seem quite impressive since they are close to 100% it is important to consider that these are patients with diseases that are not being treated, and so even one error has big consequences.
7. Conclusion (1 pts)
The purpose of this study was to try and create a classifier to be able to diagnose women with breast cancer without the doctor having to diagnose the scans manually. From the two models tested the more complex Random Forests model performed better than the KNN model. The KNN model did run faster, but since it seems reasonable to value correctness over speed the Random Forests model would probably be the best to use. Even though the KNN ran faster than the Random Forests, both running times are less than a second for more than 300 patients and should thereby be a major improvement over having a doctor having to diagnose each patient manually. For very big hospitals with many patients however it could have been of interest to have a model that is slightly faster, since this would save time when applied across that many patients, so to create a faster KNN model with better accuracy might be of interest in a future study. One way to improve the KNN model could be to remove the features that are highly correlated amongst each other since this might have an effect on the classifier. Another factor could be that the variables were chosen for the model using a tree based model, which could make these features biased to give better results in a tree based model, which Random Forests is.
A different suggestion for a future study could be to create another Random Forests model, but focus on maximizing the recall. This would be interesting, since as stated earlier, recall might actually be the most important evaluation metric, since it minimizes false positives (in this study that is diagnosing a patient with a benign tumor when in fact it is malignant). Even if this would probably result in more false negatives, these could be handled by running more tests on the people that were diagnosed with a malignant tumor, to make sure the first test performed by the classifier was actually correct. However here consideration must be taken to the patients, since even if it seems better to diagnose someone with cancer for them to later find out they do not have cancer than the other way around, this is a very hard and scary thing to go through for the person being misdiagnosed. One way to handle this could be to run the second tests mentioned above before letting the patient know of the result. Even though this leads to more testing, it is important to protect the patients. Also there are reasonably more patients with benign tumors, and so the testing time for these patients will still be less.
Overall more data could also be collected to try and make the errors even smaller. Since this is medical testing of cancer, the results are very sensitive to people, and errors in these tests can have very big consequences. Therefore even if the scores for the model were quite good they might have to be even better in order to actually be able to be used in real life. This new data could be other measurements of the breast masses that were not included in this study, but also other types of data could be tested to see if a classifier using such data works better. This could for example be raw image data from the scans.
Sources
4)https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3
6)https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74.