Exam 2 - Duplicate

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay from sklearn.metrics import classification_report

df = pd.read_csv('/content/student_record_10k.csv') df

df.drop('Unnamed: 0', axis=1,inplace=True)

df.info()

df.describe()

We can see from the above functions that there are no missing values.

for col in df.describe().columns[1:]: plt.boxplot(df[col]) plt.title(f'box plot of {col}') plt.show()

import seaborn as sns import warnings warnings.filterwarnings("ignore") sns.pairplot(df[df.describe().columns[1:]])

The above plots are used for multivariate and bivariate analysis. Since, we are already given the features to use there is no need for feature selection or data augmentation.

from sklearn import tree from sklearn.model_selection import cross_val_score X = df[['c01','c02','c03','c04','c05','c06','c07','c08','c09','c10','academic','campus','internship']] y = df[['AtRisk']] X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.30, random_state=42) clf = tree.DecisionTreeClassifier() clf.fit(X_train,y_train) accuracies=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10) print(accuracies) print("average accuracy :",np.mean(accuracies))

print(clf.score(X_test,y_test)) cm = confusion_matrix(y_test, clf.predict(X_test), labels=clf.classes_) disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=clf.classes_) disp.plot() plt.show() print(classification_report(y_test, clf.predict(X_test), labels=clf.classes_))

Decision Trees are prone to overfitting. The above model can go as deep as possible to get leaf nodes with 1 sample. This is why the model overfits. We can see that the cross validation score for each split is 1 meaning that the training accuracy is 1 but the testing accuracy is lower showing overfitting.

clf = tree.DecisionTreeClassifier(criterion='entropy',min_samples_leaf=5) clf.fit(X_train,y_train) accuracies=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10) print(accuracies) print("average accuracy :",np.mean(accuracies))

By changing the criterion to entropy the model generalizes better. The entropy usually gives a more balanced tree compared to gini while gini performs faster than entropy. Moreover changing the min_samples_leaf to 5 will avoid overfitting as the tree won't branch out leaf nodes with only 1 sample. This helps generalization of the model.

from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler X = df[['c01','c02','c03','c04','c05','c06','c07','c08','c09','c10','academic','campus','internship']] scaler = StandardScaler() X = scaler.fit_transform(X) y = df[['graduate_program']] > 0.5 X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.20, random_state=42) clf = LogisticRegression() clf.fit(X_train,y_train) clf.score(X_test,y_test)

Logistic regression is used for classification problems therefore the likelihood is converted into labels 0 and 1. The logistic regression is used to build the model which classifies the input vector into 0(not cont. in grad prog.) or 1(cont. in grad prog.). The model gives 0.996 accuracy on test data.

from sklearn.ensemble import RandomForestClassifier X = df[['c01','c02','c03','c04','c05','c06','c07','c08','c09','c10','academic','campus','internship']] y = df[['placement']] > 0.5 X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.20, random_state=42) clf = RandomForestClassifier(criterion='entropy',min_samples_leaf=5) clf.fit(X_train,y_train) clf.score(X_test,y_test)

RandomForestClassifier is used to classify using columns 'c01', 'c02', ..., 'c10','academic', 'campus', and 'internship' whether the student with these features will get placement (1) or not(0). The randomforest model gives 0.9865 accuracy on test data.

from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from sklearn.metrics import r2_score X1 = df[['academic']] X2 = df[['campus']] X3 = df[['internship']] y = df['annual'] X_train, X_test, y_train,y_test = train_test_split(X1, y, test_size=0.20, random_state=42) clf = LinearRegression() clf.fit(X_train,y_train) pred =clf.predict(X_test) print('Mean Square error :',mean_squared_error(y_test,pred, squared = False)) print('r2 score:',r2_score(y_test,pred)) X_train, X_test, y_train,y_test = train_test_split(X2, y, test_size=0.20, random_state=42) clf = LinearRegression() clf.fit(X_train,y_train) pred =clf.predict(X_test) print('Mean Square error :',mean_squared_error(y_test,pred, squared = False)) print('r2 score:',r2_score(y_test,pred)) X_train, X_test, y_train,y_test = train_test_split(X3, y, test_size=0.20, random_state=42) clf = LinearRegression() clf.fit(X_train,y_train) pred =clf.predict(X_test) print('Mean Square error :',mean_squared_error(y_test,pred, squared = False)) print('r2 score:',r2_score(y_test,pred))

Random_state is equal for all train test split as all the models are trained and tested on same data so that they could be compared. The SLR model with academic feature gives the lowest RMSE error and highest R2 score therefore, could be considered the best among these model. The low RMSE error means the model is best in terms of predictability and high r2 score means it explains more variance compared to the other 2 models.

from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import mean_squared_error from sklearn.metrics import r2_score from sklearn.linear_model import Ridge X1 = df[['academic']] X2 = df[['campus']] X3 = df[['internship']] y = df['annual'] poly = PolynomialFeatures() X1 = poly.fit_transform(X1) X_train, X_test, y_train,y_test = train_test_split(X1, y, test_size=0.20, random_state=42) clf = Ridge() clf.fit(X_train,y_train) pred =clf.predict(X_test) print('Mean Square error academic:',mean_squared_error(y_test,pred, squared = False)) print('Adjusted R2 academic:',r2_score(y_test,pred)) poly = PolynomialFeatures() X2 = poly.fit_transform(X2) X_train, X_test, y_train,y_test = train_test_split(X2, y, test_size=0.20, random_state=42) clf = Ridge() clf.fit(X_train,y_train) pred =clf.predict(X_test) print('Mean Square error campus:',mean_squared_error(y_test,pred, squared = False)) print('Adjusted R2 campus:',r2_score(y_test,pred)) poly = PolynomialFeatures() X3 = poly.fit_transform(X3) X_train, X_test, y_train,y_test = train_test_split(X3, y, test_size=0.20, random_state=42) clf = Ridge() clf.fit(X_train,y_train) pred =clf.predict(X_test) print('Mean Square error internship:',mean_squared_error(y_test,pred, squared = False)) print('Adjusted R2 internship:',r2_score(y_test,pred))

Random_state is equal for all train test split as all the models are trained and tested on same data so that they could be compared. The simple polynomial regression model with "academic" feature gives the lowest RMSE error and highest R2 score therefore, could be considered the best among these model. The low RMSE error means the model is best in terms of predictability and high r2 score means it explains more variance compared to the other 2 models. All the models use ridge regularization regression.

from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from sklearn.metrics import r2_score X = df[['academic','campus','internship']] y = df['annual'] X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.20, random_state=42) clf = LinearRegression() clf.fit(X_train,y_train) pred =clf.predict(X_test) print('Mean Square error academic:',mean_squared_error(y_test,pred, squared = False)) print('Adjusted R2 academic:',r2_score(y_test,pred))

The MLR model gives 17307.67 RMSE and R2 score of the 0.86. The model performs significantly better than the SLR and polynomial regression models with lower RMSE and higher R2 score.

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler X = df[['c01','c02','c03','c04','c05','c06','c07','c08','c09','c10']] scaled_df = StandardScaler().fit_transform(X) #initialize kmeans parameters kmeans_kwargs = { "init": "random", "n_init": 10, "random_state": 1, } #create list to hold SSE values for each k sse = [] for k in range(1, 11): kmeans = KMeans(n_clusters=k, **kmeans_kwargs) kmeans.fit(scaled_df) sse.append(kmeans.inertia_) #visualize results plt.plot(range(1, 11), sse) plt.xticks(range(1, 11)) plt.xlabel("Number of Clusters") plt.ylabel("SSE") plt.title("Elbow Method") plt.show()

!pip install yellowbrick

from yellowbrick.cluster import SilhouetteVisualizer fig, ax = plt.subplots(2, 2, figsize=(15,8)) for i in [2, 3, 4, 5]: ''' Create KMeans instance for different number of clusters ''' km = KMeans(n_clusters=i, init='k-means++', n_init=1, max_iter=1, random_state=42) q, mod = divmod(i, 2) ''' Create SilhouetteVisualizer instance with KMeans instance Fit the visualizer ''' visualizer = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[q-1][mod]) visualizer.fit(X)

According to both, the silhouette score and elbow method, 3 is the optimal number of clusters. The elbow is formed at 3 and also for k=3, only 1 of the cluster has less points with -ve silhouette score compared to others.

kmeans = KMeans(n_clusters=3) kmeans.fit(scaled_df) X['Label'] = kmeans.labels_ X

from sklearn.datasets import make_blobs from sklearn.cluster import DBSCAN from sklearn.metrics import silhouette_score import numpy as np from sklearn.preprocessing import StandardScaler # generate some random data X = df[['c01','c02','c03','c04','c05','c06','c07','c08','c09','c10']] X = StandardScaler().fit_transform(X) # try different values of epsilon and min_samples epsilons = np.linspace(1, 20, num=10) min_samples_values = range(3, 6) # initialize variables to store best values best_silhouette_score = -1 best_epsilon = None best_min_samples = None # iterate over all combinations of epsilon and min_samples for epsilon in epsilons: for min_samples in min_samples_values: try: dbscan = DBSCAN(eps=epsilon, min_samples=min_samples) labels = dbscan.fit_predict(X) silhouette_score_value = silhouette_score(X, labels) if silhouette_score_value > best_silhouette_score: best_silhouette_score = silhouette_score_value best_epsilon = epsilon best_min_samples = min_samples except: pass print('Best epsilon:', best_epsilon) print('Best min_samples:', best_min_samples) print('Best silhouette score:', best_silhouette_score)

Various models are trained on different epsilons by iterating over 10 epsilons in between 1 to 20 and min_samples from 3 to 6. The best epsilon value was found 1.0 and best min_samples 3 at which the lowest silhouette score was found that is 0.30368. More values of epsilon and min_samples could be used to find better value of epsilon and min_samples.

dbscan = DBSCAN(eps=best_epsilon, min_samples=best_min_samples) labels = dbscan.fit_predict(X) silhouette_score_value = silhouette_score(X, labels) print(silhouette_score_value)

from sklearn.decomposition import PCA X = df[['c01','c02','c03','c04','c05','c06','c07','c08','c09','c10']] pca = PCA(n_components=3) pca.fit(X) print('Explained variance ratio:',pca.explained_variance_ratio_) print('Singular Values',pca.singular_values_) print(pca.transform(X))

For n_components=3 each feature explains variance about 47%, 31%, and 12% respectively.

from sklearn.decomposition import PCA X = df[['c01','c02','c03','c04','c05','c06','c07','c08','c09','c10']] pca = PCA(n_components=.8) pca.fit(X) print('Explained variance ratio:',pca.explained_variance_ratio_) print('Singular Values',pca.singular_values_) print('no. of components',pca.n_components_) print(pca.transform(X))

The PCA features generated that explains 80% variances gives out 3 features.

!pip install apyori !pip install openpyxl

from apyori import apriori

import ast df['elective'].apply(lambda x: ast.literal_eval(x))

def conv2list(x): rec = x.split("'") basket = [] for i in rec: if len(i)>4: basket.append(i) return basket df['elective'] = df['elective'].apply(lambda x : conv2list(x))

basket_data = df['elective']

basket_data = basket_data.to_list()

basket_data[0]

association_rules = apriori(basket_data, min_support=0.4, min_confidence=0.9, min_lift=1.1, max_length=2) association_results = list(association_rules)

print(len(association_results))

for r in association_results: print("=====================================") print('Frequent itemset:{} with support'.format(list(r[0])), r[1]) print('--Association Rules') for a in r[2]: print('----Rule: {} -> {}'.format(list(a[0]), list(a[1]))) print('------Confidence: {}'.format(a[2])) print('------Lift: {}'.format(a[3]))

Only 2 rules where found that have more than 40% support and 90% confidence. The rule ['Data_Mining'] -> ['Machine_Learning'] has a confidence of 98% and lift 1.32. It can be concluded that if student takes Data Mining as elective it is highly probable that he will also take Machine Learning. The other rule ['Data_Structures_and_Algorithms'] -> ['Python_for_Data_Science'] has a confidence of 100% and lift 1.32. It can be concluded that if student takes Data_Structures_and_Algorithms as elective it is highly probable that he will also take Python_for_Data_Science.