Data Cleansing in Telco

Case

The management of Telco company wants to reduce the number of churn customers by using machine learning. Therefore, the Data Scientist team was asked to prepare data as well as create the right prediction model to determine whether customers will churn or not.

As a data scientist, you are required to prepare data before modeling.

In this task, you will perform Data Preprocessing (Data Cleansing) last month, namely June 2020. The steps to be taken are :

Looking for a valid customer ID (Phone number)

Overcome data that is still empty (Missing Values)

Overcoming Outlier Values from each Variable

Standardizing the Value of a Variable

Data Source and Library

import pandas as pd

df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/dqlab_telco.csv') print(df.head())

print(df.info())

Columns Definition :

The Number of Rows and Columns

print(df.shape)

The Number of CustomerID (Unique)

print(df.customerID.nunique())

Let's Start to Our Task

Filtering Customer ID with Format

Character length is 11-12.

Consists of numbers only, no characters other than numbers are allowed

Starting with the number 45 the first 2 digits.

We will use count( ) to count the number of rows. It's because we have differences of its format so we will make it into one format with astype( ), str.match and regex to match the specified criteria.

df['valid_id'] = df['customerID'].astype(str).str.match(r'(45\d{9,10})') df = (df[df['valid_id'] == True]).drop('valid_id', axis = 1) print('The Number of the Filtered CustomerID is', df['customerID'].count())

Filtering the Duplication of Customer ID

We have to make sure there is no duplication of the CustomerID for the effectiveness and accuracy of the analysis. The duplication appears because of either its inserting more than once with the same value in each column or its inserting different data retrieval periods. We will use drop_duplicates( ) to delete it and sort_values( ) to check the latest data collection.

df.drop_duplicates() # Drop duplicate ID sorted by Periode df = df.sort_values('UpdatedAt', ascending=False).drop_duplicates('customerID') print('The Number of Customer ID without Duplication (disticnt)',df['customerID'].count())

The validity of the customer ID number is very necessary to ensure that the data we take is correct. Based on these results, there are differences in the number of ID numbers from the first data loaded to the final result. The number of rows of data when it was first loaded was 7113 rows and 22 columns with 7017 unique ID numbers. Then after checking the validity of the customer ID, the remaining 6993 rows of data.

Resolving Missing Values by Deleting Rows

Next we will delete rows from undetected data whether it churns or not. It is assumed that the data modeller will only accept data that has a churn flag or not. We will use isnull( ) to detect missing values and dropna( ) to remove missing values.

print('Total of Missing Values from Churn Columns', df['Churn'].isnull().sum())

# Dropping all Rows with spesific column (churn) df.dropna(subset=['Churn'], inplace=True) print('Total of Rows and Columns after Deleting Missing Values', df.shape)

Overcoming Missing Values by Filling in Certain Values

It is assumed that the data modeller asks for missing values to fill in with the following criteria:

The data modeller's asks for each row that has missing values for the length of the subscription to be filled with 11. Variables that are numeric other than Tenure are filled with the median of each of these variables. Define:

Is there still data that is missing values

The number of missing values of each variable

Handle the missing values

print('Status Missing Values :',df.isnull().values.any()) print('\nJumlah Missing Values masing-masing kolom, adalah:') print(df.isnull().sum().sort_values(ascending=False))

df['tenure'].fillna(11, inplace=True) #Loop #Handling missing values num vars (except Tenure) for col_name in list(['MonthlyCharges','TotalCharges']): median = df[col_name].median() df[col_name].fillna(median,inplace=True) print('\nJumlah Missing Values setelah di imputer datanya, adalah:') print(df.isnull().sum().sort_values(ascending=False))

After further analysis, it turns out that there are still missing values from the data that we have validated for the customer ID number. Missing values are listed in the Churn, tenure, MonthlyCharges and TotalCharges columns. After we handle it by deleting rows and filling rows with certain values, it is proven that there are no more missing values in the data, as evidenced by the number of missing values for each variable which is worth 0. Next, we will handle outliers.

Detecting Outlier with Boxplot

A boxplot is a summary of the sample distribution presented graphically that can describe the shape of the data distribution (skewness), a measure of central tendency and a measure of dispersion (diversity). The following is a general view of the boxplot representing outliers.

print('\nDistribution of Data Before the Outliers Being Handled: ') print(df[['tenure','MonthlyCharges','TotalCharges']].describe())

# Creating Box Plot import matplotlib.pyplot as plt import seaborn as sns plt.figure() sns.boxplot(x=df['tenure']) plt.show() plt.figure() sns.boxplot(x=df['MonthlyCharges']) plt.show() plt.figure() sns.boxplot(x=df['TotalCharges']) plt.show()

Resolving the Outliers

After we know which variables have outliers, then we will overcome outliers by using the interquartile range (IQR) method.

Before :

print(df[['tenure','MonthlyCharges','TotalCharges']].describe())

After :

Q1 = (df[['tenure','MonthlyCharges','TotalCharges']]).quantile(0.25) Q3 = (df[['tenure','MonthlyCharges','TotalCharges']]).quantile(0.75) IQR = Q3 - Q1 maximum = Q3 + (1.5 * IQR) print('Maximum Value from Each Variable is:') print(maximum) minimum = Q1 - (1.5 * IQR) print('\nMinimum Value from Each Variable is:') print(minimum) more_than = (df > maximum) lower_than = (df < minimum) df = df.mask(more_than, maximum, axis=1) df = df.mask(lower_than, minimum, axis=1) print('\nData Distribution after Outliers being Handled:') print(df[['tenure','MonthlyCharges','TotalCharges']].describe())

From the three boxplots with variables 'tenure', 'MonthlyCharges' & 'TotalCharges' it is clear that there are outliers. This can be identified from the points that are far from the boxplot image. Then if we look at the distribution of the data from the max column, there is also a very high value.

Then the outlier value is handled by changing its value to the maximum & minimum value of the interquartile range (IQR). After handling the outliers, and looking at the distribution of the data, it appears that there are no more outlier values.

Detecting Non Standard Value

Detects whether there are values of non-standard categorical variables. This usually occurs due to data input errors. Differences in terms are one of the factors that often occur, for that we need standardization of the data that has been inputted.

We will use the value_counts( ) function to see the number of unique data per variable.

for col_name in list(['gender', 'SeniorCitizen', 'Partner', 'Dependents','PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity','OnlineBackup', 'DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract', 'PaperlessBilling','PaymentMethod','Churn']): print('\nUnique Values Count \033[1m' + 'Before Standardized \033[0m Variable', col_name) print(df[col_name].value_counts())

print("The bold text is",'\033[1m' + 'Python' + '\033[0m') Here, '\033[0m' ends the bold formatting. If it is not added, the next print statement will keep print the bold text.

Standardizing Categorical Variables

After we know which variables have non-standard values, then we standardize them with the most patterns, provided that they do not change their meaning. Example: Yes -> Yes Then look again at the unique values of each variable that has been changed. Use the replace() function to standardize the values.

df = df.replace(['Wanita','Laki-laki','Churn','Iya'],['Female','Male','Yes','Yes']) #Loop for col_name in list(['gender', 'Dependents', 'Churn']): print('\nUnique Values Count \033[1m' + 'After Standardized \033[0mVariable',col_name) print(df[col_name].value_counts())

After we standardize the value, and we observe again the shape of the data, it is standardized well for the unique value.

So, that's all for this article. I hope I can write it consistently and make an improvement for the future for my code. Feel free to contact me on e-mail if there is something wrong or collaboration project.

Thanks All!!!

Addition : We can use built-in ANSI escape sequences for making text bold, italic or colored, etc. By using the special ANSI escape sequences, the text can be printed in different formats. The ANSI escape sequence to print bold text is: '\033[1m'. To print the bold text, we use the following statement.

Reference : https://www.delftstack.com/howto/python/python-bold-text/

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Data Cleansing in Telco

Case

Data Source and Library

Let's Start to Our Task

Data Cleansing in Telco