Intro
Most cancer-related deaths in women are particularly caused by breast cancer. With over 30% of all female malignancies, it is the most prevalent cancer in women globally and is regarded as a complex illness (i.e. 1.5 million women are diagnosed with breast cancer each year, and 500,000 women die from this disease in the world). While the death rate has reduced over the previous 30 years, this condition has become more prevalent. Mammography screening is thought to have a 20% reduction in mortality and a 60% improvement in cancer therapy. Early detection, nevertheless, can save lives.
Aim and Objectives
The goal of this project is to determine when cancer has the potential to cause harm, including death and to deploy a machine learning model that predicts the benignity or malignancy of a cancer based on the dataset provided.
A) Inference
The data gleaned was structured data, and it consisted of 569 rows and 32 columns namely; mean radius, mean texture, mean perimeter, mean area, mean smoothness and diagnosis.
We got 32 columns and 569 rows
Inference from Class Distribution
The imbalance ratio tells us that the majority class being Benign is 1.68 times more instance than the minority classes
Box Plot Inference
The boxplot analysis did give us a good gist of the spread of the data and also helped us to analyse the skewness of each and every specific parameter with respect to the diagnosis variable and help identify the outliers for the same
Pair plot Inference
The blue ones represent Malignant tumour and orange ones represents Benign tumour. The pairplot of all some relevant features is plotted which visualises the relation between them.
Correlation Barplot
The correlation between different variables and the target is shown. 1) There is positive correlation between diagnosed benign and ‘smoothness_error’, 2) There is very less positive correlation with ‘fractal_dimension_mean’, ‘texture_error’ and ‘symmetry_error’. 3) All other factors shows negative correlation with diagnosed as benign(0).
Inference
Data Normalisation
The main purpose of normalisation/ Gaussian Transformation is to - change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
Why we do ? - Some Machine Learning models like linear regression assume that data is normally distributed. - Otherwise Data cannot be transformed using some of the Gaussian transformation techniques.
Histogram and Q-Q plot of main features can be seen below. The features which requires normalisation to fit into Gaussian distribution are converted using logarithmic transformation.
Inference after Normalisation
After logarithmic transformation, the Q-Q plot looks like a straight line. Hence these variables are normalised.
F. Correlation Analysis
From the above analysis, highly Correlated Features are -
'concavity_worst',
'area_worst',
'concave points_worst',
'perimeter_mean',
'concave points_mean'
G. Feature importance
Top 5 feature to classify from the algorithm
1. area_worst 2. area_mean 3. area_se 4. perimeter_worst 5. perimeter_mean
Inference from Random Forest Regarding Feature Importance
The model has effectively predicted 162 cases, comprising both true positives and true negatives, out of the total count of 171 cases. The instances of false negatives and false positives amount to 9 cases.
In summary, the model demonstrates an overall accuracy rate of 94%, affirming its suitability for practical applications.
Conclusion
The ultimate aim of this EDA is to understand in depth about various parameters in the dataset which are included in the diagnosis of breast cancer. The primary goal of the analysis was to get the parameters which strongly correlate with each other .The analysis also give us a good gist of the patterns and how well can we predict a case of benign or malignant if fit a machine learning model.