# Descriptive Analytics

**Descriptive analytics** is a commonly used form of data analysis whereby historical data is collected, organised and then presented in a way that is easily understood. Descriptive analytics is focused only on what has already happened in a business and, unlike other methods of analysis, *it is not used to draw inferences or predictions from its findings*.

### Import modules

There are several modules for conducting descriptive analytics as follows:

**pandas**: used for data manipulation**numpy**: used for linear algebra calculation**matplotlib**: used for data visualization**seaborn**: used for data manipulation**plotnine**: used for data manipulation

**Note**: that there are more than one module used for making a visualization. The ** plotnine** can be your choice if you are familier with ggplot2 on R porgramming. It will create your visualization beautiful

Sometimes, our script will produce a warning, basically it is not an error. However, we can ignore it using the ** warnings** module as follows.

### Import data

Before doing an analysis, we need to import our dataset into Python. Pandas provides a function to easily import the data. It handles many kind of data type such as ** csv**,

**,**

`xlsx`

**,**

`txt`

**, etc. In this case, we will import an CSV data. So, we use**

`sql`

**method on**

`read_csv`

**module with spesific separator of**

`pandas`

*,*.

**Note**: locate the data a certain folder, in this case, the data is in ** data** folder

After importing the dataset into Python, the ** df** is now our dataset or dataframe (similar term). The datafame has a lot of methods that print a spesific output about the characteristic of dataframe. The method

**will print out all the column names.**

`columns`

#### Columns

Each row represents an employee; each column contains employee's attributes in HCM dataset. The dataset have the following attributes or features or columns:

**Employee_ID**: Employee unique ID**GENDER**: Employee gender:*M = Male; F = Female***ADMINS**: Employee's work regional**SECTIONS**: Employee's section**DEPARTMENT**: Employee's department**DIVISION**: Employee's division**GRP**: Employee's group (sub-directorate)**EGRP**: Employee's egroup**DIREKTORAT**: Employee's directorate**EDU_LVL**: Employee's education level**JOB_CATEGORY**: Employee's job category**BAND**: Employee's band**AGE**: Employee's age (as of 30th June 2020)**YOS**: Employee's year of service (as of 30th June 2020)**PA H1 2020**: Employee's performance report on H1 2020**RANKING PA H1 2020**: Employee's ranking based on performance report on H1 2020**PA H2 2019**: Employee's performance report on H2 2019**RANKING PA H2 2019**: Employee's ranking based on performance report on H2 2019**CA 2019**: Employee's competency attainment on 2019**Number of Trainings (Course) H12020**: Number of training course the employee participated on H1 2020**Number of Trainings (Course) H22019**: Number of training course the employee participated on H2 2019**Absence Jan-2020**: Number of days (weekdays) employee were absence/not clockin without any notification on Jan 2020**Absence Feb-2020**: Number of days (weekdays) employee were absence/not clockin without any notification on Feb 2020**Absence Mar-2020**: Number of days (weekdays) employee were absence/not clockin without any notification on Mar 2020**Absence Apr-2020**: Number of days (weekdays) employee were absence/not clockin without any notification on Apr 2020**Absence May-2020**: Number of days (weekdays) employee were absence/not clockin without any notification on May 2020**Absence Jun-2020**: Number of days (weekdays) employee were absence/not clockin without any notification on Jun 2020**Is Promoted 2020**: Flag wether the employee getting promotion on 2020 or not:*1 = promoted 0 = not promoted***Is Promoted 2019**: Flag wether the employee getting promotion on 2019 or not:*1 = promoted 0 = not promoted*

The command ** len(df)** will print out the number of rows while

**will print out the number of columns in the**

`len(df.columns)`

**dataframe. To make it easier to understand, we will print the dimension of the data as follows. Then**

`df`

**will print the first five rows of dataframe, but we can spesify the rows with argument**

`df.head()`

**. For instance:**

`n`

**will print the first three rows of dataframe.**

`df.head(n = 3)`

**Note**: dimension represent the number of rows and columns in the dataset. It's like a matrix with rows and columns.

#### Count the missing value on the dataframe

#### Scale measurement

The method ** info** will show us the metadata or information about the columns. It undirectly spesifies the scale measurement of a given columns in the dataframe. Bu it can be misleading. So, we must modify the scale measurement or column type based on column characteristic.

Note that in some cases, not all the data contains number is a numerical. The user maybe represent the categorical data using number, such as customer satisfaction. So, it can not be actually calculated using mathematics operation. Let's convert it to make our data analysis more valid.

In this case, the column ** Is Promoted 2019** and

**should be the categorical data or in pandas these are**

`Is Promoted 2020`

*object*type. So, our first task is to convert it to the correct types.

As mentioned in previous section, the categorical data can not be calculated using mathematics operation such as addition, subtraction, etc. However, it can be summarized using **frequency distribution**.

In statistics, a frequency distribution is a list, table or graph that displays the frequency of various outcomes in a sample.

The method ** select_dtypes('object')** will select only a certain columns with type is an

**object**.

The method

**will generate the unique values of the selected columns (categorical)**

`value_counts()`

#### Summary statistics for the numerical data

Numerical data, as its name suggests, represents numbers. Numerical data is information that is something that is measurable. The ** pandas** provides a module namely

**to give us the summary statistics of numerical data.**

`describe`

The question, what is a summary statistics? It is simply the information that gives a quick and simple description of the data. It can include `mean`

, `median`

, `mode`

, `minimum value`

, `maximum value`

, `range`

, `standard deviation`

, etc.

### Get Insights in Descriptive Analytics

To find the insights from the data. We will make a simulation. In this simulation, we will have some questions asked by stakeholders. And with the data, we will answer the questions.

#### 1 Correlation coefficient using Heatmap

The degree of association is measured by a correlation coefficient, denoted by **r**. It is sometimes called Pearson's correlation coefficient after its originator and is a measure of linear association.

When two sets of data are strongly linked together we say they have a **High Correlation**.

- Correlation is
**Positive**when the values increase together, and - Correlation is
**Negative**when one value decreases as the other increases

Correlation can have a value:

**1**is*a perfect positive correlation***0**is*no correlation (the values don't seem linked at all)***-1**is*a perfect negative correlation*

**Note**: a correlation is assumed to be linear (following a line).

**The correlation formula**

**What is the heatmap?**

A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. Heatmaps are used in various forms of analytics but are most commonly used to show user behaviour on specific webpages or webpage templates.

Explanation of arguments:

: spesify the figure size to 12 and 12`figsize = (12, 12)`

: round the correlation up or down to three decimal place`round(corrmat, 3)`

: spesify the color theme for the heatmap. There are a lot of color theme can be used`cmap = 'YlOrBr'`

**here**: choose the line width for heatmap`linewidths = 0.1`

: annotate the correlation coefficient to the heatmap`annot = True`

#### 2 Distribution of gender by education level

**The stacked bar plot** extends the standard bar plot from looking at numeric values across one categorical variable to two. Each bar in a standard bar plot is divided into a number of sub-bars stacked end to end, each one corresponding to a level of the second categorical variable.

**The percentage stacked bar plot**

Another common option for stacked bar charts is the percentage, or relative frequency, stacked bar chart. Here, each primary bar is scaled to have the same height, so that each sub-bar becomes a percentage contribution to the whole at each primary category level.

The pandas ** crosstab** function builds a cross-tabulation table that can show the frequency with which certain groups of data appear.

We will create a cross-tabulation table between ** GENDER** column with

**.**

`EDU_LVL`

: reset the index after running the`reset_index()`

method. It makes our data more beatiful`crosstab`

: remove the index names in dataframe`df_gender_edu.columns.name`

*df_gender_edu*

The ** pandas melt()** function is used to change the DataFrame format

**from wide to long**. It’s used to create a specific format of the DataFrame object where one or more columns work as identifiers. All the remaining columns are treated as values and unpivoted to the row axis and only two columns –

**variable**and

**value**.

**Explanation:**

We can use

to specify which columns we want to melt or stack into column.`value_vars`

So,=`value_vars`

. We use the`['D1','D2','D3','D4','S1','S2','S3','SD','SLTA']`

column to`GENDER`

.`id_vars`

: sort the`df_gender_edu_melt.sort_values('Total', ascending = False)`

dataframe based on the*df_gender_edu_melt***Total**column increasingly

In order to create a percentage stacked bar plot, we will order the **education level** column from **SD** to **S3**. It makes the data visualization more beautiful. Previously, we assign the **education level** column as a *category* type. It because the categorical column can not be ordered before being assigned as *category*.

The **plotnine** is an implementation of a *grammar of graphics* in Python, it is based on **ggplot2**. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot.

Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots easy to think about and then create, while the simple plots remain simple.

**The pattern of ggplot2**

```
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
```

**Explanation**

: set the figure size of plotnine`plotnine.options.figure_size = (8, 4.8)`

: the first layer of ggplot. It takes the`ggplot(data = df_gender_edu_melt)`

data`df_gender_edu_melt`

: the layer to create a barplot`geom_bar()`

: set the`x = 'Education'`

*Education*column as the horizontal axis: set the`y = 'Total'`

*Total*column as the vertical axis: separate the`fill = 'Gender'`

*Total*column based on a*Gender*categories: an argument to create a percentage stacked bar plot`position = 'fill'`

: a layer to assign the plot title, subtitle, and caption`labs()`

: assign a plot title`title`

: assign a plot subtitle`subtitle`

: a layer to assign the horizontal axis title`xlab()`

: a layer to assign the vertical axis title`ylab()`

: order the vertical axis element (discrete)`scale_x_discrete()`

: a layer to color the plot and customize the legend`scale_fill_manual()`

: color code. It can be in hex format or plain format, for instance:`values()`

`white`

,`black`

,`red`

, etc: change the element names of legend`labels()`

: theme that is assigned to the plot. Find out a lot of ggplot theme`theme_minimal()`

**here**

#### 3 How much employee is promoted in 2020?

To find out the ratio between promoted employee and not promoted employee, we must create a frequency table using the *Is Promoted 2020* column. Then calculate the percentage using its total value.

We can rename the spesific column names using ** rename** method in pandas.

`df_promote.rename(columns = {'index':'Status', 'Is Promoted 2020':'Total'}, inplace = True)`

**Explanation**

: the data is modified in place, which means it will return nothing and the dataframe is now updated`inplace = True`

**Pie charts** are very popular for showing a compact overview of a `composition` or `comparison`. While they can be harder to read than column charts, they remain a popular choice for small datasets.

#### 4 Which gender that has the highest promoted employee percentage in 2020?

To find the gender which has the highest promoted employee percentage in 2020, we must create cross-tabulation table between **GENDER** and **Is Promoted 2020** column. It is similar to the previous section when we tried to create a cross-tabulation table.

We create a new colum to determine the percentage of promoted employee by the gender

**Explanation**

: rename the horizontal axis label`scale_x_discrete()`

#### 5 Which education level that has the highest promoted employee percentage in 2020?

To find the education which has the highest promoted employee percentage in 2020, we must create cross-tabulation table between **EDU_LVL** and **Is Promoted 2020** column. It is similar to the previous section when we tried to create a cross-tabulation table.

**Note**: there are many education level which don't have employee who is promoted in 2020. So, because of its range, we decide to filter the data. We take out the rows which *Promoted* is zero.

**SD**: take out!**D2**: take out!**D4**: take out!**S3**: take out!

**Explanation**

: the layer to create a barplot`geom_bar()`

: filter the fill color where it will be`fill = np.where(df_promoted_edu['Education'] == 'S1', '#981220', '#80797c')`

**#981220**if the education level is**S1**and**#80797c**for others

: a layer to add the label on a certain plot`geom_label()`

: set the`x = 'Education'`

*Education*column as the horizontal axis: set the`y = 'Promoted'`

*Promoted*column as the vertical axis: assign the`label = 'Promoted'`

*Promoted*column to the bar: the size of label box`size = 10`

: the vertical adjustment to nudge labels by`nudge_y = 0.05`

#### 6 Can you tell me about the distribution of employees age based on promoted status?

**The density plot** visualizes the distribution of data over a continuous interval. This plot is a variation of a **histogram** that uses **kernel smoothing** to plot values, allowing for smoother distributions by smoothing out the noise.

**Explanation**

: a layer to create density plot in ggplot2`geom_density()`

: create the density plot of`x = 'AGE'`

*AGE*column: separate the density plot based on`fill = 'Is Promoted 2020'`

*Is Promoted 2020*column: change the borderline color of density plot`color = 'white'`

: transparency of fill color in density plot`alpha = 0.85`

: a layer to color the plot and customize the legend`scale_fill_manual`

: change the legend title`name`

#### Box and Whisker plot

**Box and Whisker plot** is a standardized way of displaying the distribution of data based on a five number summary:

**Minimum value****First quartile (Q1)****Median****Third quartile (Q3)****Maximum value**

We need to have information on the variability or dispersion of the data. **A box and Whisker plot is a graph that gives us a good indication of how the values in the data are spread out**. Although boxplots may seem primitive in comparison to a histogram or density plot, they have the advantage of taking up less space, which is useful when comparing distributions between many groups or datasets.

**Explanation**

: a layer to create box and whisker plot in ggplot2`geom_boxplot()`

: a categorical column to separate the distribution`x = 'Is Promoted 2020'`

: a numerical column to create a box plot`y = 'AGE'`

: give a color on categorical column to separate the distribution`fill = 'Is Promoted 2020'`

: a layer to flip cartesian coordinates so that horizontal becomes vertical, and vertical becomes horizontal`coord_flip`

#### 7 Is promotion is determined based on employees year of service?

#### 8 Whether promotions spread evenly across employees' ages and year of service

**A scatterplot** is a type of data visualization that shows the relationship between two numerical variables. Each point of the dataset gets plotted as a point whose (x, y) coordinates relates to its values for the two variables. The strength of the correlation can be determined by how closely packed the points are to each other on the graph. Points that end up far outside the general cluster of points are known as outliers.