Graduate-admission 

by Reslan TinawiJan 14, 2021
8 likes5 duplicates
Share
Twitter iconTwitter
Facebook iconFacebook
Email
Copy link
Save as PDF
  1. Intro
  2. Import libraries
  3. Load data
  4. Getting to know the data
  5. Data cleaning and Preprocessing
  6. Exploratory Data Analysis (EDA)
    1. Univariate Analysis
      1. `GRE Score`
      2. `TOEFL Score`
      3. `University Rating`
      4. `SOP`
      5. `LOR`
      6. `CGPA`
      7. `Research`
      8. `Chance of Admit`
    2. Bivariate Analysis
      1. Correlation matrix
      2. Scatter plot
      3. Bivariate distributions
      4. `Research`
      5. `University Rating`
    3. Multivariate Analysis
      1. Scatter matrix with `Research`
      2. Bivariate distribution with `University Rating`
      3. `Research` and `University Rating`

Intro

In this notebook I'll apply different EDA (Exploratory Data Analysis) techniques on the Graduate Admission 2 data.

The goal in this data is to predict the student's chance of admission to a postgraduate education, given several predictor variables for the student.

Import libraries

import pandas as pd import numpy as np import seaborn as sns import plotly.express as px import matplotlib.pyplot as plt from scipy import stats # set seaborn theme sns.set_style(style="whitegrid")

Load data

There are two data files:

  • Admission_Predict.csv
  • Admission_Predict_Ver1.1.csv Will use the second one, since it contains more data points.
df = pd.read_csv("data/Admission_Predict_Ver1.1.csv")

According to the dataset author on Kaggle, the columns in this data represents:

  • GRE Score: The Graduate Record Examinations is a standardized test that is an admissions requirement for many graduate schools in the United States and Canada.
  • TOEFL Score: Score in TOEFL exam.
  • University Rating: Student undergraduate university ranking.
  • SOP: Statement of Purpose strength.
  • LOR: Letter of Recommendation strength.
  • CGPA: Undergraduate GPA.
  • Research: Whether student has research experience or not.
  • Chance of Admit: Admission chance.

Getting to know the data

In this section, we'll take a quick look at the data, to see how many row are there, and whther there are any missing values or not, to decie what kind of preprocessing will be needed.

df.head()
df.columns
df.shape
df.isnull().sum()
df.dtypes

The dataset consists of 500 samples and 9 columns: 8 predictors and one target variable.

There are no missing values (which is a very good thing!), but some column names need to be cleaned, and the Serial No. must be removed, as it has nothing to do with the student's overall admission chance.

Lookin at the dtypes it seems that all columns are in the correct data type, discrete columns are in int64 and continuous in float64.

Data cleaning and Preprocessing

As stated in the previous section, only few cleaning will be performed, mainly:

  • remove extra whitespace from column names.
  • drop Serial No. column
  • convert Research column to bool.
df.columns

Pandas has a great feature which allows us to apply multiple functions on the DataFrame in a sequential order: the pipe method.

Here, I'll define two separate functions for applying each processing step, and then call them using the pipe function.

def read_data(): temp_df = pd.read_csv("data/Admission_Predict_Ver1.1.csv") return temp_df
def normalize_column_names(temp_df): return temp_df.rename( columns={"LOR ": "LOR", "Chance of Admit ": "Chance of Admit"} )
def drop_noisy_columns(temp_df): return temp_df.drop(columns=["Serial No."])
def normalize_dtypes(temp_df): return temp_df.astype({"Research": bool, "University Rating": str})
def sort_uni_ranking(temp_df): return temp_df.sort_values(by="University Rating")

Now, we plug them together:

df = ( read_data() .pipe(normalize_column_names) .pipe(drop_noisy_columns) .pipe(normalize_dtypes) .pipe(sort_uni_ranking) )
df.columns
df.shape
df.dtypes

We cleaned the data with a clean code!

Exploratory Data Analysis (EDA)

In this section, we'll explore the data visually and summarize it using descriptive statistic methods.

To keep things simpler, we'll divide this section into three subsections: 1. Univariate analysis: in this section we'll focus only at one variable at a time, and study the variable descriptive statistics with some charts like: Bar chart, Line chart, Histogram, Boxplot, etc ..., and how the variable is distributed, and if there is any skewness in the distribution. 2. Bivariate analysis: in this section we'll study the relation between two variables, and present different statistics such as Correlation, Covariance, and will use some other charts like: scatterplot, and will make use of the hue parameter of the previous charts. 3. Multivariate analysis: in this section we'll study the relation between three or more variables, and will use additional type of charts, such as parplot.

Univariate Analysis

Here in this section, will perform analysis on each variable individually, but according to the variable type different methods and visualization will be used, main types of variables:

  • Numerical: numerical variables are variables which measures things like: counts, grades, etc ..., and they don't have a finite set of values, and they can be divided to:
    • Continuous: continuous variables are continous measurements such as weight, height.
    • Discrete: discrete variables represent counts such as number of children in a family, number of rooms in a house.
  • Categorical: a categorical variable is a variable which takes one of a limited values, and it can be further divided to:
    • Nominal: nominal variable has a finite set of possible values, which don't have any ordereing relation among them, like countries, for example we can't say that France is higher than Germany: France > Germany, therfore, there's no sense of ordering between the values in a noinal variable.
    • Ordinal: in contrast to Nominal variable, ordinal varible defines an ordering relation between the values, such as the student performance in an exam, which can be: Bad, Good, Very Good, and Excellent (there's an ordering relation among theses values, and we can say that Bad is lower than Good: Bad < Good)
    • Binary: binary variables are a special case of nominal variables, but they only have two possible values, like admission status which can either be Accepted or Not Accepted.

resources:

  • Variable types and examples
  • What is the difference between ordinal, interval and ratio variables? Why should I care?

Let's see what are the types of variables in our dataset:

df.describe()
  • Discrete: GRE Score and TOEFL Score are discrete variables.
  • Continuous: CGPA and Chance of Admit are continuous variables.
  • Ordinal: University Rating, SOP and LOR are ordinal variables.
  • Binary: Research is a binary variable.

GRE Score

The GRE Score is a discrete variable.

df["GRE Score"].describe()
print(df["GRE Score"].mode())
print(stats.skew(df["GRE Score"]))
px.histogram(df, x="GRE Score", nbins=20, marginal="box")
sns.displot(df, x="GRE Score", kind="hist", kde=True)