Intro
In this notebook I'll apply different EDA (Exploratory Data Analysis) techniques on the Graduate Admission 2 data.
The goal in this data is to predict the student's chance of admission to a postgraduate education, given several predictor variables for the student.
Import libraries
Load data
There are two data files:
Admission_Predict.csv
Admission_Predict_Ver1.1.csv
Will use the second one, since it contains more data points.
According to the dataset author on Kaggle, the columns in this data represents:
GRE Score
: The Graduate Record Examinations is a standardized test that is an admissions requirement for many graduate schools in the United States and Canada.TOEFL Score
: Score in TOEFL exam.University Rating
: Student undergraduate university ranking.SOP
: Statement of Purpose strength.LOR
: Letter of Recommendation strength.CGPA
: Undergraduate GPA.Research
: Whether student has research experience or not.Chance of Admit
: Admission chance.
Getting to know the data
In this section, we'll take a quick look at the data, to see how many row are there, and whther there are any missing values or not, to decie what kind of preprocessing will be needed.
The dataset consists of 500 samples and 9 columns: 8 predictors and one target variable.
There are no missing values (which is a very good thing!), but some column names need to be cleaned, and the Serial No.
must be removed, as it has nothing to do with the student's overall admission chance.
Lookin at the dtypes
it seems that all columns are in the correct data type, discrete columns are in int64
and continuous in float64
.
Data cleaning and Preprocessing
As stated in the previous section, only few cleaning will be performed, mainly:
- remove extra whitespace from column names.
- drop
Serial No.
column - convert
Research
column to bool.
Pandas has a great feature which allows us to apply multiple functions on the DataFrame
in a sequential order: the pipe method.
Here, I'll define two separate functions for applying each processing step, and then call them using the pipe
function.
Now, we plug them together:
We cleaned the data with a clean code!
Exploratory Data Analysis (EDA)
In this section, we'll explore the data visually and summarize it using descriptive statistic methods.
To keep things simpler, we'll divide this section into three subsections:
1. Univariate analysis: in this section we'll focus only at one variable at a time, and study the variable descriptive statistics with some charts like: Bar chart, Line chart, Histogram, Boxplot, etc ..., and how the variable is distributed, and if there is any skewness in the distribution.
2. Bivariate analysis: in this section we'll study the relation between two variables, and present different statistics such as Correlation, Covariance, and will use some other charts like: scatterplot, and will make use of the hue
parameter of the previous charts.
3. Multivariate analysis: in this section we'll study the relation between three or more variables, and will use additional type of charts, such as parplot.
Univariate Analysis
Here in this section, will perform analysis on each variable individually, but according to the variable type different methods and visualization will be used, main types of variables:
- Numerical: numerical variables are variables which measures things like: counts, grades, etc ..., and they don't have a finite set of values, and they can be divided to:
- Continuous: continuous variables are continous measurements such as weight, height.
- Discrete: discrete variables represent counts such as number of children in a family, number of rooms in a house.
- Categorical: a categorical variable is a variable which takes one of a limited values, and it can be further divided to:
- Nominal: nominal variable has a finite set of possible values, which don't have any ordereing relation among them, like countries, for example we can't say that
France
is higher thanGermany
:France
>Germany
, therfore, there's no sense of ordering between the values in a noinal variable. - Ordinal: in contrast to
Nominal
variable, ordinal varible defines an ordering relation between the values, such as the student performance in an exam, which can be:Bad
,Good
,Very Good
, andExcellent
(there's an ordering relation among theses values, and we can say thatBad
is lower thanGood
:Bad
<Good
) - Binary: binary variables are a special case of nominal variables, but they only have two possible values, like admission status which can either be
Accepted
orNot Accepted
.
- Nominal: nominal variable has a finite set of possible values, which don't have any ordereing relation among them, like countries, for example we can't say that
resources:
Let's see what are the types of variables in our dataset:
- Discrete:
GRE Score
andTOEFL Score
are discrete variables. - Continuous:
CGPA
andChance of Admit
are continuous variables. - Ordinal:
University Rating
,SOP
andLOR
are ordinal variables. - Binary:
Research
is a binary variable.
GRE Score
The GRE Score
is a discrete variable.