Intro

Most cancer-related deaths in women are particularly caused by breast cancer. With over 30% of all female malignancies, it is the most prevalent cancer in women globally and is regarded as a complex illness (i.e. 1.5 million women are diagnosed with breast cancer each year, and 500,000 women die from this disease in the world). While the death rate has reduced over the previous 30 years, this condition has become more prevalent. Mammography screening is thought to have a 20% reduction in mortality and a 60% improvement in cancer therapy. Early detection, nevertheless, can save lives.

Aim and Objectives

The goal of this project is to determine when cancer has the potential to cause harm, including death and to deploy a machine learning model that predicts the benignity or malignancy of a cancer based on the dataset provided.

import jupyterthemes as jt !jt -t onedork -f roboto -fs 12

import pandas as pd import ipywidgets import numpy as np # linear algebra import matplotlib.pyplot as plt import seaborn as sns sns.set_style('whitegrid') # for feature importance from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import f1_score, accuracy_score, confusion_matrix from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2

file_id = '1b2Ztm96gkVadOftMmlUXowbQtzGL8Vqp' download_link = f'https://drive.google.com/uc?id={file_id}' data = pd.read_csv(download_link)

A) Inference

The data gleaned was structured data, and it consisted of 569 rows and 32 columns namely; mean radius, mean texture, mean perimeter, mean area, mean smoothness and diagnosis.

data.shape

We got 32 columns and 569 rows

data.head()

data.describe(include='O')

data.info()

def describe_independent(subset): return subset.describe().T.style.bar(subset=['mean'], color='#205ff2')\ .background_gradient(subset=['std'], cmap='Reds')\ .background_gradient(subset=['50%'], cmap='coolwarm') describe_independent(data)

options = list(data.columns)

data.select_dtypes('O')

data['diagnosis'].value_counts()

import seaborn as sns from imblearn.under_sampling import NearMiss from imblearn.over_sampling import SMOTE sns.countplot(x='diagnosis', data=data) plt.title('Class Distribution') plt.show() class_distribution = data['diagnosis'].value_counts() print("Class Distribution:") print(class_distribution) imbalance_ratio = class_distribution[0] / class_distribution[1] print("Imbalance Ratio: {:.2f}".format(imbalance_ratio)) # Under-sampling using NearMiss nm = NearMiss() X_resampled, y_resampled = nm.fit_resample(data.drop('diagnosis', axis=1), data['diagnosis']) # Synthetic data generation using SMOTE smote = SMOTE() X_resampled_smote, y_resampled_smote = smote.fit_resample(data.drop('diagnosis', axis=1), data['diagnosis'])

Inference from Class Distribution

The imbalance ratio tells us that the majority class being Benign is 1.68 times more instance than the minority classes

data.isnull().sum()

data.duplicated().sum()

data.apply(lambda x: len(x.unique())).sort_values(ascending=False).head(10)

sns.countplot(data=data,x="diagnosis") plt.show()

f,ax=plt.subplots(1,2,figsize=(18,8)) data['diagnosis'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True) ax[0].set_title('diagnosis') ax[0].set_ylabel('')

def createBoxPlot(givenData, variable): plt.figure(figsize=(4, 4)) sns.boxplot(x='diagnosis', y=variable, data=givenData) plt.title('Boxplot for Diagnosis vs '+variable) plt.xlabel('Diagnosis (0: Malignant, 1: Benign)') plt.ylabel(variable) plt.show() grouped_data = data.groupby('diagnosis')[variable] summary_stats = grouped_data.describe().loc[:, ['min', '25%', '50%', '75%', 'max']] summary_stats.columns = ['Minimum', 'First Quartile', 'Median', 'Third Quartile', 'Maximum'] print(summary_stats) createBoxPlot(data, 'radius_mean') createBoxPlot(data, 'texture_mean')

inputVariable

smoothness_mean

createBoxPlot(data,inputVariable)

Box Plot Inference

The boxplot analysis did give us a good gist of the spread of the data and also helped us to analyse the skewness of each and every specific parameter with respect to the diagnosis variable and help identify the outliers for the same

sns.set() cols = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'diagnosis'] sns.pairplot(data[cols], hue = 'diagnosis', height = 2.5) plt.show();

Pair plot Inference

The blue ones represent Malignant tumour and orange ones represents Benign tumour. The pairplot of all some relevant features is plotted which visualises the relation between them.

g=sns.FacetGrid(data, col = 'diagnosis') g.map(plt.hist,'radius_mean', bins = 20)

corrmat = data.corr() f, ax = plt.subplots(figsize=(15, 10)) sns.heatmap(corrmat, vmax=.8, square=True, cmap = 'Spectral'); # sns.heatmap(corrmat, vmax=.8, square=True, annot = True, cmap = 'Spectral');

df1 = data.drop(['diagnosis'], axis=1) plt.figure(figsize=(16, 5)) data['diagnosis'] = data['diagnosis'].replace({'M': 1, 'B': 0}) index = df1.corrwith(data['diagnosis']).index dia = df1.corrwith(data['diagnosis']) ax = sns.barplot(x=index, y=dia) ax.tick_params(labelrotation=90)

Correlation Barplot

The correlation between different variables and the target is shown. 1) There is positive correlation between diagnosed benign and ‘smoothness_error’, 2) There is very less positive correlation with ‘fractal_dimension_mean’, ‘texture_error’ and ‘symmetry_error’. 3) All other factors shows negative correlation with diagnosed as benign(0).

df_new = data df_new = df_new.drop(['id','fractal_dimension_mean','texture_se','smoothness_se','symmetry_se','fractal_dimension_se'],axis = 1)

from scipy.stats import norm from scipy import stats import matplotlib.pyplot as plt def diagnostic_plots(df, variable): plt.figure(figsize=(10, 4)) plt.subplot(1, 2, 1) sns.histplot(df[variable], kde=True, stat='density') plt.title(f'Histogram of {variable}') plt.subplot(1, 2, 2) stats.probplot(df[variable], dist='norm', plot=plt) plt.title(f'Q-Q plot of {variable}') plt.tight_layout() plt.show()

# Individual histograms for each variable diagnostic_plots(df_new, 'radius_mean') diagnostic_plots(df_new, 'texture_mean') diagnostic_plots(df_new, 'perimeter_mean')

diagnostic_plots(df_new, 'area_mean') diagnostic_plots(df_new, 'smoothness_mean') diagnostic_plots(df_new, 'area_worst')

diagnostic_plots(df_new, 'perimeter_worst') diagnostic_plots(df_new, 'radius_worst') diagnostic_plots(df_new, 'texture_worst') diagnostic_plots(df_new, 'perimeter_se')

Inference

Data Normalisation

The main purpose of normalisation/ Gaussian Transformation is to - change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

Why we do ? - Some Machine Learning models like linear regression assume that data is normally distributed. - Otherwise Data cannot be transformed using some of the Gaussian transformation techniques.

Histogram and Q-Q plot of main features can be seen below. The features which requires normalisation to fit into Gaussian distribution are converted using logarithmic transformation.

#applying log transformation to convert radius_mean into a gaussian distribution df_new['radius_mean'] = np.log(df_new['radius_mean'] + 1) diagnostic_plots(df_new, 'radius_mean') #applying log transformation to convert texture_mean into a gaussian distribution ##df_new['texture_mean'] = np.log(df_new['texture_mean'] + 1)# +1 is added in case there is any 0 input to it which would create issue in taking log ##diagnostic_plots(df_new, 'texture_mean') #applying log transformation df_new['perimeter_mean'] = np.log(df_new['perimeter_mean'] + 1)# +1 is added in case there is any 0 input to it which would create issue in taking log diagnostic_plots(df_new, 'perimeter_mean') #applying log transformation

df_new['area_mean'] = np.log(df_new['area_mean'] + 1)# +1 is added in case there is any 0 input to it which would create issue in taking log diagnostic_plots(df_new, 'area_mean') #applying log transformation df_new['area_worst'] = np.log(df_new['area_worst'] + 1)# +1 is added in case there is any 0 input to it which would create issue in taking log diagnostic_plots(df_new, 'area_worst') #applying log transformation df_new['perimeter_worst'] = np.log(df_new['perimeter_worst'] + 1)# +1 is added in case there is any 0 input to it which would create issue in taking log diagnostic_plots(df_new, 'perimeter_worst') #applying log transformation

df_new['radius_worst'] = np.log(df_new['radius_worst'] + 1)# +1 is added in case there is any 0 input to it which would create issue in taking log diagnostic_plots(df_new, 'radius_worst') #applying log transformation df_new['texture_worst'] = np.log(df_new['texture_worst'] + 1)# +1 is added in case there is any 0 input to it which would create issue in taking log diagnostic_plots(df_new, 'texture_worst') #applying log transformation df_new['perimeter_se'] = np.log(df_new['perimeter_se'] + 1)# +1 is added in case there is any 0 input to it which would create issue in taking log diagnostic_plots(df_new, 'perimeter_se')

Inference after Normalisation

After logarithmic transformation, the Q-Q plot looks like a straight line. Hence these variables are normalised.

F. Correlation Analysis

correlation_matrix = data.corr()

plt.figure(figsize=(15, 15)) sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5) plt.title("Correlation Matrix") plt.show()

mask = np.triu(correlation_matrix) plt.figure(figsize=(15, 15)) sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5, mask=mask) plt.title("Correlation Matrix with Upper Triangle Mask") plt.show()

threshold = 0.8 highly_correlated_features = set() for i in range(len(correlation_matrix.columns)): for j in range(i): if abs(correlation_matrix.iloc[i, j]) > threshold: colname = correlation_matrix.columns[i] highly_correlated_features.add(colname) print("Highly correlated features:", highly_correlated_features)

From the above analysis, highly Correlated Features are -

'concavity_worst',

'area_worst',

'concave points_worst',

'perimeter_mean',

'concave points_mean'

G. Feature importance

# Defining the variables y=data['diagnosis'] drop_list1 = ['perimeter_mean','concavity_worst','area_se','area_mean','radius_mean','compactness_mean','concave points_mean','radius_se','perimeter_se','radius_worst','perimeter_worst','compactness_worst','concave points_worst','compactness_se','concave points_se','texture_worst','area_worst'] x_1=data[drop_list1] x_1.head()

# Splitting the data to test & Train X_train, X_test, y_train, y_test = train_test_split(x_1, y, test_size=0.3, random_state=40)

# Calculating the F-Score: select_feature= SelectKBest(chi2,k=5).fit(X_train,y_train) k=select_feature.scores_ a={'score':k,'Name':X_train.columns} b=pd.DataFrame(a) print(b)

Top 5 feature to classify from the algorithm

1. area_worst 2. area_mean 3. area_se 4. perimeter_worst 5. perimeter_mean

# Plotting the confusion Matrix: x_train_2 = select_feature.transform(X_train) x_test_2 = select_feature.transform(X_test) clf_rf_2 = RandomForestClassifier() clr_rf_2 = clf_rf_2.fit(x_train_2,y_train) cm_2 = confusion_matrix(y_test,clf_rf_2.predict(x_test_2)) sns.heatmap(cm_2,annot=True,fmt="d") # Checking the model accuracy: ac_2 = accuracy_score(y_test,clf_rf_2.predict(x_test_2)) print('Accuracy is: ',ac_2)

Inference from Random Forest Regarding Feature Importance

The model has effectively predicted 162 cases, comprising both true positives and true negatives, out of the total count of 171 cases. The instances of false negatives and false positives amount to 9 cases.

In summary, the model demonstrates an overall accuracy rate of 94%, affirming its suitability for practical applications.

Conclusion

The ultimate aim of this EDA is to understand in depth about various parameters in the dataset which are included in the diagnosis of breast cancer. The primary goal of the analysis was to get the parameters which strongly correlate with each other .The analysis also give us a good gist of the patterns and how well can we predict a case of benign or malignant if fit a machine learning model.

Intro

Aim and Objectives

The goal of this project is to determine when cancer has the potential to cause harm, including death and to deploy a machine learning model that predicts the benignity or malignancy of a cancer based on the dataset provided.

A) Inference

We got 32 columns and 569 rows

Inference from Class Distribution

Box Plot Inference

Pair plot Inference

Correlation Barplot

Inference

Inference after Normalisation

F. Correlation Analysis

From the above analysis, highly Correlated Features are -

G. Feature importance

Top 5 feature to classify from the algorithm

Inference from Random Forest Regarding Feature Importance

Conclusion

Notebook App Link - pythongroup4-jupyternotebook.com

Web App Link - pythongroup4-webapp-streamlit.com

.css-hdxizt{color:var(--chakra-colors-fg-neutral-primary);font-weight:var(--chakra-fontWeights-bold);letter-spacing:-0.09px;}Intro

Aim and Objectives

The goal of this project is to determine when cancer has the potential to cause harm, including death and to deploy a machine learning model that predicts the benignity or malignancy of a cancer based on the dataset provided.

A) Inference

We got 32 columns and 569 rows

Inference from Class Distribution

Box Plot Inference

Pair plot Inference

Correlation Barplot

Inference

Inference after Normalisation

F. Correlation Analysis

From the above analysis, highly Correlated Features are -

G. Feature importance

Top 5 feature to classify from the algorithm

Inference from Random Forest Regarding Feature Importance

Conclusion

Web App Link - pythongroup4-webapp-streamlit.com

Intro