# Intro

In this notebook I'll apply different EDA (Exploratory Data Analysis) techniques on the Graduate Admission 2 data.

The goal in this data is to predict the *student's chance of admission* to a postgraduate education, given several *predictor* variables for the student.

# Import libraries

# Load data

There are two data files:

`Admission_Predict.csv`

`Admission_Predict_Ver1.1.csv`

Will use the second one, since it contains more data points.

According to the dataset author on Kaggle, the columns in this data represents:

`GRE Score`

: The Graduate Record Examinations is a standardized test that is an admissions requirement for many graduate schools in the United States and Canada.`TOEFL Score`

: Score in TOEFL exam.`University Rating`

: Student undergraduate university ranking.`SOP`

: Statement of Purpose strength.`LOR`

: Letter of Recommendation strength.`CGPA`

: Undergraduate GPA.`Research`

: Whether student has research experience or not.`Chance of Admit`

: Admission chance.

# Getting to know the data

In this section, we'll take a quick look at the data, to see how many row are there, and whther there are any missing values or not, to decie what kind of preprocessing will be needed.

The dataset consists of 500 samples and 9 columns: 8 *predictors* and one *target* variable.

There are no missing values (which is a very good thing!), but some column names need to be cleaned, and the `Serial No.`

must be removed, as it has nothing to do with the student's overall admission chance.

Lookin at the `dtypes`

it seems that all columns are in the correct data type, discrete columns are in `int64`

and continuous in `float64`

.

# Data cleaning and Preprocessing

As stated in the previous section, only few *cleaning* will be performed, mainly:

- remove extra whitespace from column names.
- drop
`Serial No.`

column - convert
`Research`

column to bool.

Pandas has a great feature which allows us to apply multiple functions on the `DataFrame`

in a sequential order: the pipe method.

Here, I'll define two separate functions for applying each processing step, and then call them using the `pipe`

function.

Now, we plug them together:

We *cleaned* the data with a *clean* code!

# Exploratory Data Analysis (EDA)

In this section, we'll explore the data *visually* and summarize it using *descriptive statistic* methods.

To keep things simpler, we'll divide this section into three subsections:
1. Univariate analysis: in this section we'll focus only at one variable at a time, and study the variable descriptive statistics with some charts like: Bar chart, Line chart, Histogram, Boxplot, etc ..., and how the variable is distributed, and if there is any *skewness* in the distribution.
2. Bivariate analysis: in this section we'll study the relation between *two* variables, and present different statistics such as Correlation, Covariance, and will use some other charts like: scatterplot, and will make use of the `hue`

parameter of the previous charts.
3. Multivariate analysis: in this section we'll study the relation between three or more variables, and will use additional type of charts, such as parplot.

## Univariate Analysis

Here in this section, will perform analysis on each variable individually, but according to the variable type different methods and visualization will be used, main types of variables:

- Numerical: numerical variables are variables which measures things like: counts, grades, etc ..., and they don't have a
*finite*set of values, and they can be divided to:- Continuous: continuous variables are continous measurements such as weight, height.
- Discrete: discrete variables represent counts such as number of children in a family, number of rooms in a house.

- Categorical: a categorical variable is a variable which takes one of a limited values, and it can be further divided to:
- Nominal: nominal variable has a finite set of possible values, which don't have any ordereing relation among them, like countries, for example we can't say that
`France`

is higher than`Germany`

:`France`

>`Germany`

, therfore, there's no sense of ordering between the values in a noinal variable. - Ordinal: in contrast to
`Nominal`

variable, ordinal varible defines an ordering relation between the values, such as the student performance in an exam, which can be:`Bad`

,`Good`

,`Very Good`

, and`Excellent`

(there's an ordering relation among theses values, and we can say that`Bad`

is lower than`Good`

:`Bad`

<`Good`

) - Binary: binary variables are a special case of nominal variables, but they only have
*two*possible values, like admission status which can either be`Accepted`

or`Not Accepted`

.

- Nominal: nominal variable has a finite set of possible values, which don't have any ordereing relation among them, like countries, for example we can't say that

resources:

Let's see what are the types of variables in our dataset:

- Discrete:
`GRE Score`

and`TOEFL Score`

are discrete variables. - Continuous:
`CGPA`

and`Chance of Admit`

are continuous variables. - Ordinal:
`University Rating`

,`SOP`

and`LOR`

are ordinal variables. - Binary:
`Research`

is a binary variable.

`GRE Score`

The `GRE Score`

is a discrete variable.

We can conclude from the previous charts the following:

- The GRE scores are
*very close*to a normal distribution, with a small negative skewnewss (left skewed). - The most common scores are between
`310`

and`325`

. - The average score is
`316`

with a standard deviation of`11.2`

. - There are no outliers.

This variable doesn't need any further processing.

`TOEFL Score`

The `TOEFL Score`

is a discrete variable.

From the previous charts, we can conclude:

- TOEFL scores are also normally distributed, with a small positive (right skewness).
- The average TOEFL score is
`107`

with a standard deviation`6`

. - The most common scores are:
`110`

,`105`

,`104`

and`112`

. - There are no outliers.

The variable doesn't need any further processing.

`University Rating`

The `University Rating`

is an ordinal variable, it represents the student's undergraduate university ranking on a scale 1-5.

We can see that the most common rating is in the middle: `3`

. The chart shows that the ratings are distributed in a similar fashion to the normal distrbution.

`SOP`

`SOP`

stands for the strength of *Statement of Purpose* which is a necessary document for graduate applications. The values were (mostly) entered by the students, and it's on scale 1-5, so this is an ordinal variable.

Most students estimated the strength of their *Statement of Purpose* between `3`

and `4`

.

`LOR`

`LOR`

stands for the strength of *Letter of Recommendation*. The values were (mostly) entered by the students, and it's on scale 1-5, so this is an ordinal variable.

Most of the students rated the strength of ther *Letter of Recommendation* between `3`

and `4`

.

`CGPA`

The `CGPA`

stands for the student's *cumulative grade point average*, which represents the average of grade points obtained in all the subjects by the student.

It's a continuous variable, on a scale 0-10.

As we can see, this variable is *very* close to a normal distribution, with a small negative (left) skewness, and there are no outliers.

`Research`

The `Research`

variable indicates whether the student has any research experience or not, so it's a `Binary`

variable.

Although, it would be better to have a variable like `Research duration`

which expresses for how long was the student involved in a research activity.

From this plot, we can see that the number of students who have a research experience is *almost* equal to the number of students who don't.

Later, we'll study the relation of this variable with other variables.

`Chance of Admit`

Quoting the dataste author from this thread:

chance of admit is a parameter that was asked to individuals (some values manually entered) before the results of the application

So thie column is not an actual *probability of admission* estimated by the universities or something, rather, it's an estimation by the student themselves of how likely they'll be admitted to the university.

The plot shows that most student estimated their chance of admission between `0.7`

and `0.75`

.

The distribution is *moderately* skewed to the left with a negative skew value `-0.29`

.

There are also two outlier values `0.34`

.

## Bivariate Analysis

In this section, we'll focus on studying the relationship between two different variables, to answer different question, like

- What is the relation between variable
`x`

and variable`y`

? is it linear or non-linear? - In case of a linear relation, is positive linear relation or negative linear relation? and how
*strong*is the relation? - How the distribution for two variables changes?

### Correlation matrix

We'll start off by computing the correlation matrix using `.corr`

method of pandas, which computes the pairwise correlation of columns.

The method used for calculating the correlation between two variables `x`

and `y`

is the Pearson correlation coefficient.

The pearson coefficient is a measure of the linear correlation between two variables `x`

and `y`

, and it takes values between `-1`

and `+1`

.

A negative value indicates a negative correlation (i.e. when one variable *increases* the other *decreases*), and a positive value is the opposite (the two variables *increases*/*decreases* at the same time)

Here, we'll compute the correlations only for `GRE Score`

, `TOEFL Score`

and `CGPA`

variables, because they are *numeric* variables, and they weren't estimated by the students themselves.

We can see from the correlation matrix that the three variables have a strong positive correlation. We'll look closer at the relations between variables using scatter plots.

### Scatter plot

Scatter plots are a good way to show the spread of points for two variables `x`

and `y`

, and view the relation between the two variables (e.g. linear, non-linear, ...) and the trend of the linear relation (positive, negative)

An easy way to show multiple scatter plots on the same figure is either using `scatter_matrix`

or `pairplot`

.

It's evident from these plots that the relation between the variables is positive linear relation, with some outlier points, and they all have an upward trend.

Let's show the scatter for each two variables at a time:

`TOEFL Score`

vs. `GRE Score`

`TOEFL Score`

vs. `CGPA`

`GRE Score`

vs. `CGPA`

From the previous three charts we can say that: students who perform well in their `TOEFL`

exams tend to also perform well in `GRE`

exams, and they *mostly* have high `GPA`

(higher than 9).

### Bivariate distributions

Another way to study the relation between two variables is with 2D Histograms (distribution).

Just like the distributions we used in the **Univariate Analysis** section, we can show the distribution for two variables `x`

and `y`

, which would give us better insights on how much the values from the two variables overlap, and show cluster regions in the 2D space.

Compared to scatter plots, 2D histograms are better at handling large amounts of data, as they use rectangular bins, and count the number of points withing each bin.

`TOEFL Score`

vs. `GRE Score`

We can see from this chart some *clusters* (regions).

For example there are two clusters of students who scored between `110`

and `115`

in the `TOEFL`

exam and between `320`

and `330`

in the `GRE`

exam. These two clusters account for about 100 students (which is 20% of the total dataset).

`TOEFL Score`

vs. `CGPA`

This chart shows that about `170`

students has `TOEFL`

score in range `[105-115]`

and their `CGPA`

is in range `[8-9]`